This is a note to let you know that I've just added the patch titled
android: binder: use VM_ALLOC to get vm area
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-testing branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will be merged to the char-misc-next branch sometime soon,
after it passes testing, and the merge window is open.
If you have any questions about this process, please let me know.
>From aac6830ec1cb681544212838911cdc57f2638216 Mon Sep 17 00:00:00 2001
From: Ganesh Mahendran <opensource.ganesh(a)gmail.com>
Date: Wed, 10 Jan 2018 10:49:05 +0800
Subject: android: binder: use VM_ALLOC to get vm area
VM_IOREMAP is used to access hardware through a mechanism called
I/O mapped memory. Android binder is a IPC machanism which will
not access I/O memory.
And VM_IOREMAP has alignment requiement which may not needed in
binder.
__get_vm_area_node()
{
...
if (flags & VM_IOREMAP)
align = 1ul << clamp_t(int, fls_long(size),
PAGE_SHIFT, IOREMAP_MAX_ORDER);
...
}
This patch will save some kernel vm area, especially for 32bit os.
In 32bit OS, kernel vm area is only 240MB. We may got below
error when launching a app:
<3>[ 4482.440053] binder_alloc: binder_alloc_mmap_handler: 15728 8ce67000-8cf65000 get_vm_area failed -12
<3>[ 4483.218817] binder_alloc: binder_alloc_mmap_handler: 15745 8ce67000-8cf65000 get_vm_area failed -12
Signed-off-by: Ganesh Mahendran <opensource.ganesh(a)gmail.com>
Acked-by: Martijn Coenen <maco(a)android.com>
Acked-by: Todd Kjos <tkjos(a)google.com>
Cc: stable <stable(a)vger.kernel.org>
----
V3: update comments
V2: update comments
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/android/binder_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index 07b866afee54..5a426c877dfb 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -670,7 +670,7 @@ int binder_alloc_mmap_handler(struct binder_alloc *alloc,
goto err_already_mapped;
}
- area = get_vm_area(vma->vm_end - vma->vm_start, VM_IOREMAP);
+ area = get_vm_area(vma->vm_end - vma->vm_start, VM_ALLOC);
if (area == NULL) {
ret = -ENOMEM;
failure_string = "get_vm_area";
--
2.16.1
This is a note to let you know that I've just added the patch titled
USB: serial: pl2303: new device id for Chilitag
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-testing branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will be merged to the usb-next branch sometime soon,
after it passes testing, and the merge window is open.
If you have any questions about this process, please let me know.
>From d08dd3f3dd2ae351b793fc5b76abdbf0fd317b12 Mon Sep 17 00:00:00 2001
From: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Date: Thu, 25 Jan 2018 09:48:55 +0100
Subject: USB: serial: pl2303: new device id for Chilitag
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This adds a new device id for Chilitag devices to the pl2303 driver.
Reported-by: "Chu.Mike [朱堅宜]" <Mike-Chu(a)prolific.com.tw>
Cc: stable <stable(a)vger.kernel.org>
Acked-by: Johan Hovold <johan(a)kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/serial/pl2303.c | 1 +
drivers/usb/serial/pl2303.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/drivers/usb/serial/pl2303.c b/drivers/usb/serial/pl2303.c
index 57ae832a49ff..46dd09da2434 100644
--- a/drivers/usb/serial/pl2303.c
+++ b/drivers/usb/serial/pl2303.c
@@ -38,6 +38,7 @@ static const struct usb_device_id id_table[] = {
{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_RSAQ2) },
{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_DCU11) },
{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_RSAQ3) },
+ { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_CHILITAG) },
{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_PHAROS) },
{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_ALDIGA) },
{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_MMX) },
diff --git a/drivers/usb/serial/pl2303.h b/drivers/usb/serial/pl2303.h
index f98fd84890de..fcd72396a7b6 100644
--- a/drivers/usb/serial/pl2303.h
+++ b/drivers/usb/serial/pl2303.h
@@ -12,6 +12,7 @@
#define PL2303_PRODUCT_ID_DCU11 0x1234
#define PL2303_PRODUCT_ID_PHAROS 0xaaa0
#define PL2303_PRODUCT_ID_RSAQ3 0xaaa2
+#define PL2303_PRODUCT_ID_CHILITAG 0xaaa8
#define PL2303_PRODUCT_ID_ALDIGA 0x0611
#define PL2303_PRODUCT_ID_MMX 0x0612
#define PL2303_PRODUCT_ID_GPRS 0x0609
--
2.16.1
The bounce buffer is gone from the MMC core, and now we found out
that there are some (crippled) i.MX boards out there that have broken
ADMA (cannot do scatter-gather), and also broken PIO so they must
use SDMA. Closer examination shows a less significant slowdown
also on SDMA-only capable Laptop hosts.
SDMA sets down the number of segments to one, so that each segment
gets turned into a singular request that ping-pongs to the block
layer before the next request/segment is issued.
Apparently it happens a lot that the block layer send requests
that include a lot of physically discontigous segments. My guess
is that this phenomenon is coming from the file system.
These devices that cannot handle scatterlists in hardware can see
major benefits from a DMA-contigous bounce buffer.
This patch accumulates those fragmented scatterlists in a physically
contigous bounce buffer so that we can issue bigger DMA data chunks
to/from the card.
When tested with thise PCI-integrated host (1217:8221) that
only supports SDMA:
0b:00.0 SD Host controller: O2 Micro, Inc. OZ600FJ0/OZ900FJ0/OZ600FJS
SD/MMC Card Reader Controller (rev 05)
This patch gave ~1Mbyte/s improved throughput on large reads and
writes when testing using iozone than without the patch.
dmesg:
sdhci-pci 0000:0b:00.0: SDHCI controller found [1217:8221] (rev 5)
mmc0 bounce up to 128 segments into one, max segment size 65536 bytes
mmc0: SDHCI controller on PCI [0000:0b:00.0] using DMA
On the i.MX SDHCI controllers on the crippled i.MX 25 and i.MX 35
the patch restores the performance to what it was before we removed
the bounce buffers, and then some: performance is better than ever
because we now allocate a bounce buffer the size of the maximum
single request the SDMA engine can handle. On the PCI laptop this
is 256K, whereas with the old bounce buffer code it was 64K max.
Cc: Pierre Ossman <pierre(a)ossman.eu>
Cc: Benoît Thébaudeau <benoit(a)wsystem.com>
Cc: Fabio Estevam <fabio.estevam(a)nxp.com>
Cc: Benjamin Beckmeyer <beckmeyer.b(a)rittal.de>
Cc: stable(a)vger.kernel.org # v4.14+
Fixes: de3ee99b097d ("mmc: Delete bounce buffer handling")
Tested-by: Benjamin Beckmeyer <beckmeyer.b(a)rittal.de>
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
---
ChangeLog v6->v7:
- Fix the directions on dma_sync_for[device|cpu]() so the
ownership of the buffer gets swapped properly and in the right
direction for every transfer. Didn't see this because x86 PCI is
DMA coherent...
- Tested and greelighted on i.MX 25.
- Also tested on the PCI version.
ChangeLog v5->v6:
- Again switch back to explicit sync of buffers. I want to get this
solution to work because it gives more control and it's more
elegant.
- Update host->max_req_size as noted by Adrian, hopefully this
fixes the i.MX. I was just lucky on my Intel laptop I guess:
the block stack never requested anything bigger than 64KB and
that was why it worked even if max_req_size was bigger than
what would fit in the bounce buffer.
- Copy the number of bytes in the mmc_data instead of the number
of bytes in the bounce buffer. For RX this is blksize * blocks
and for TX this is bytes_xfered.
- Break out a sdhci_sdma_address() for getting the DMA address
for either the raw sglist or the bounce buffer depending on
configuration.
- Add some explicit bounds check for the data so that we do not
attempt to copy more than the bounce buffer size even if the
block layer is erroneously configured.
- Move allocation of bounce buffer out to its own function.
- Use pr_[info|err] throughout so all debug prints from the
driver come out in the same manner and style.
- Use unsigned int for the bounce buffer size.
- Re-tested with iozone: we still get the same nice performance
improvements.
- Request a text on i.MX (hi Benjamin)
ChangeLog v4->v5:
- Go back to dma_alloc_coherent() as this apparently works better.
- Keep the other changes, cap for 64KB, fall back to single segments.
- Requesting a test of this on i.MX. (Sorry Benjamin.)
ChangeLog v3->v4:
- Cap the bounce buffer to 64KB instead of the biggest segment
as we experience diminishing returns with buffers > 64KB.
- Instead of using dma_alloc_coherent(), use good old devm_kmalloc()
and issue dma_sync_single_for*() to explicitly switch
ownership between CPU and the device. This way we exercise the
cache better and may consume less CPU.
- Bail out with single segments if we cannot allocate a bounce
buffer.
- Tested on the PCI SDHCI on my laptop: requesting a new test
on i.MX from Benjamin. (Please!)
ChangeLog v2->v3:
- Rewrite the commit message a bit
- Add Benjamin's Tested-by
- Add Fixes and stable tags
ChangeLog v1->v2:
- Skip the remapping and fiddling with the buffer, instead use
dma_alloc_coherent() and use a simple, coherent bounce buffer.
- Couple kernel messages to ->parent of the mmc_host as it relates
to the hardware characteristics.
---
drivers/mmc/host/sdhci.c | 169 ++++++++++++++++++++++++++++++++++++++++++++---
drivers/mmc/host/sdhci.h | 3 +
2 files changed, 164 insertions(+), 8 deletions(-)
diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index e9290a3439d5..51d670b8104d 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -21,6 +21,7 @@
#include <linux/dma-mapping.h>
#include <linux/slab.h>
#include <linux/scatterlist.h>
+#include <linux/sizes.h>
#include <linux/swiotlb.h>
#include <linux/regulator/consumer.h>
#include <linux/pm_runtime.h>
@@ -502,8 +503,35 @@ static int sdhci_pre_dma_transfer(struct sdhci_host *host,
if (data->host_cookie == COOKIE_PRE_MAPPED)
return data->sg_count;
- sg_count = dma_map_sg(mmc_dev(host->mmc), data->sg, data->sg_len,
- mmc_get_dma_dir(data));
+ /* Bounce write requests to the bounce buffer */
+ if (host->bounce_buffer) {
+ unsigned int length = data->blksz * data->blocks;
+
+ if (length > host->bounce_buffer_size) {
+ pr_err("%s: asked for transfer of %u bytes exceeds bounce buffer %u bytes\n",
+ mmc_hostname(host->mmc), length,
+ host->bounce_buffer_size);
+ return -EIO;
+ }
+ if (mmc_get_dma_dir(data) == DMA_TO_DEVICE) {
+ /* Copy the data to the bounce buffer */
+ sg_copy_to_buffer(data->sg, data->sg_len,
+ host->bounce_buffer,
+ length);
+ }
+ /* Switch ownership to the DMA */
+ dma_sync_single_for_device(host->mmc->parent,
+ host->bounce_addr,
+ host->bounce_buffer_size,
+ mmc_get_dma_dir(data));
+ /* Just a dummy value */
+ sg_count = 1;
+ } else {
+ /* Just access the data directly from memory */
+ sg_count = dma_map_sg(mmc_dev(host->mmc),
+ data->sg, data->sg_len,
+ mmc_get_dma_dir(data));
+ }
if (sg_count == 0)
return -ENOSPC;
@@ -858,8 +886,13 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_command *cmd)
SDHCI_ADMA_ADDRESS_HI);
} else {
WARN_ON(sg_cnt != 1);
- sdhci_writel(host, sg_dma_address(data->sg),
- SDHCI_DMA_ADDRESS);
+ /* Bounce buffer goes to work */
+ if (host->bounce_buffer)
+ sdhci_writel(host, host->bounce_addr,
+ SDHCI_DMA_ADDRESS);
+ else
+ sdhci_writel(host, sg_dma_address(data->sg),
+ SDHCI_DMA_ADDRESS);
}
}
@@ -2248,7 +2281,12 @@ static void sdhci_pre_req(struct mmc_host *mmc, struct mmc_request *mrq)
mrq->data->host_cookie = COOKIE_UNMAPPED;
- if (host->flags & SDHCI_REQ_USE_DMA)
+ /*
+ * No pre-mapping in the pre hook if we're using the bounce buffer,
+ * for that we would need two bounce buffers since one buffer is
+ * in flight when this is getting called.
+ */
+ if (host->flags & SDHCI_REQ_USE_DMA && !host->bounce_buffer)
sdhci_pre_dma_transfer(host, mrq->data, COOKIE_PRE_MAPPED);
}
@@ -2352,8 +2390,45 @@ static bool sdhci_request_done(struct sdhci_host *host)
struct mmc_data *data = mrq->data;
if (data && data->host_cookie == COOKIE_MAPPED) {
- dma_unmap_sg(mmc_dev(host->mmc), data->sg, data->sg_len,
- mmc_get_dma_dir(data));
+ if (host->bounce_buffer) {
+ /*
+ * On reads, copy the bounced data into the
+ * sglist
+ */
+ if (mmc_get_dma_dir(data) == DMA_FROM_DEVICE) {
+ unsigned int length = data->bytes_xfered;
+
+ if (length > host->bounce_buffer_size) {
+ pr_err("%s: bounce buffer is %u bytes but DMA claims to have transferred %u bytes\n",
+ mmc_hostname(host->mmc),
+ host->bounce_buffer_size,
+ data->bytes_xfered);
+ /* Cap it down and continue */
+ length = host->bounce_buffer_size;
+ }
+ dma_sync_single_for_cpu(
+ host->mmc->parent,
+ host->bounce_addr,
+ host->bounce_buffer_size,
+ DMA_FROM_DEVICE);
+ sg_copy_from_buffer(data->sg,
+ data->sg_len,
+ host->bounce_buffer,
+ length);
+ } else {
+ /* No copying, just switch ownership */
+ dma_sync_single_for_cpu(
+ host->mmc->parent,
+ host->bounce_addr,
+ host->bounce_buffer_size,
+ mmc_get_dma_dir(data));
+ }
+ } else {
+ /* Unmap the raw data */
+ dma_unmap_sg(mmc_dev(host->mmc), data->sg,
+ data->sg_len,
+ mmc_get_dma_dir(data));
+ }
data->host_cookie = COOKIE_UNMAPPED;
}
}
@@ -2543,6 +2618,14 @@ static void sdhci_adma_show_error(struct sdhci_host *host)
}
}
+static u32 sdhci_sdma_address(struct sdhci_host *host)
+{
+ if (host->bounce_buffer)
+ return host->bounce_addr;
+ else
+ return sg_dma_address(host->data->sg);
+}
+
static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
{
u32 command;
@@ -2636,7 +2719,8 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
*/
if (intmask & SDHCI_INT_DMA_END) {
u32 dmastart, dmanow;
- dmastart = sg_dma_address(host->data->sg);
+
+ dmastart = sdhci_sdma_address(host);
dmanow = dmastart + host->data->bytes_xfered;
/*
* Force update to the next DMA block boundary.
@@ -3217,6 +3301,68 @@ void __sdhci_read_caps(struct sdhci_host *host, u16 *ver, u32 *caps, u32 *caps1)
}
EXPORT_SYMBOL_GPL(__sdhci_read_caps);
+static int sdhci_allocate_bounce_buffer(struct sdhci_host *host)
+{
+ struct mmc_host *mmc = host->mmc;
+ unsigned int max_blocks;
+ unsigned int bounce_size;
+ int ret;
+
+ /*
+ * Cap the bounce buffer at 64KB. Using a bigger bounce buffer
+ * has diminishing returns, this is probably because SD/MMC
+ * cards are usually optimized to handle this size of requests.
+ */
+ bounce_size = SZ_64K;
+ /*
+ * Adjust downwards to maximum request size if this is less
+ * than our segment size, else hammer down the maximum
+ * request size to the maximum buffer size.
+ */
+ if (mmc->max_req_size < bounce_size)
+ bounce_size = mmc->max_req_size;
+ max_blocks = bounce_size / 512;
+
+ /*
+ * When we just support one segment, we can get significant
+ * speedups by the help of a bounce buffer to group scattered
+ * reads/writes together.
+ */
+ host->bounce_buffer = devm_kmalloc(mmc->parent,
+ bounce_size,
+ GFP_KERNEL);
+ if (!host->bounce_buffer) {
+ pr_err("%s: failed to allocate %u bytes for bounce buffer, falling back to single segments\n",
+ mmc_hostname(mmc),
+ bounce_size);
+ /*
+ * Exiting with zero here makes sure we proceed with
+ * mmc->max_segs == 1.
+ */
+ return 0;
+ }
+
+ host->bounce_addr = dma_map_single(mmc->parent,
+ host->bounce_buffer,
+ bounce_size,
+ DMA_BIDIRECTIONAL);
+ ret = dma_mapping_error(mmc->parent, host->bounce_addr);
+ if (ret)
+ /* Again fall back to max_segs == 1 */
+ return 0;
+ host->bounce_buffer_size = bounce_size;
+
+ /* Lie about this since we're bouncing */
+ mmc->max_segs = max_blocks;
+ mmc->max_seg_size = bounce_size;
+ mmc->max_req_size = bounce_size;
+
+ pr_info("%s bounce up to %u segments into one, max segment size %u bytes\n",
+ mmc_hostname(mmc), max_blocks, bounce_size);
+
+ return 0;
+}
+
int sdhci_setup_host(struct sdhci_host *host)
{
struct mmc_host *mmc;
@@ -3713,6 +3859,13 @@ int sdhci_setup_host(struct sdhci_host *host)
*/
mmc->max_blk_count = (host->quirks & SDHCI_QUIRK_NO_MULTIBLOCK) ? 1 : 65535;
+ if (mmc->max_segs == 1) {
+ /* This may alter mmc->*_blk_* parameters */
+ ret = sdhci_allocate_bounce_buffer(host);
+ if (ret)
+ return ret;
+ }
+
return 0;
unreg:
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index 54bc444c317f..1d7d61e25dbf 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -440,6 +440,9 @@ struct sdhci_host {
int irq; /* Device IRQ */
void __iomem *ioaddr; /* Mapped address */
+ char *bounce_buffer; /* For packing SDMA reads/writes */
+ dma_addr_t bounce_addr;
+ unsigned int bounce_buffer_size;
const struct sdhci_ops *ops; /* Low level hw interface */
--
2.14.3
In commit 2bfa0719ac2a ("usb: gadget: function: f_fs: pass
companion descriptor along") there is a pointer arithmetic
bug where the comp_desc is obtained as follows:
comp_desc = (struct usb_ss_ep_comp_descriptor *)(ds +
USB_DT_ENDPOINT_SIZE);
Since ds is a pointer to usb_endpoint_descriptor, adding
7 to it ends up going out of bounds (7 * sizeof(struct
usb_endpoint_descriptor), which is actually 7*9 bytes) past
the SS descriptor. As a result the maxburst value will be
read incorrectly, and the UDC driver will also get a garbage
comp_desc (assuming it uses it).
Since Felipe wrote, "Eventually, f_fs.c should be converted
to use config_ep_by_speed() like all other functions, though",
let's finally do it. This allows the other usb_ep fields to
be properly populated, such as maxpacket and mult. It also
eliminates the awkward speed-based descriptor lookup since
config_ep_by_speed() does that already using the ones found
in struct usb_function.
Fixes: 2bfa0719ac2a ("usb: gadget: function: f_fs: pass companion descriptor along")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jack Pham <jackp(a)codeaurora.org>
---
drivers/usb/gadget/function/f_fs.c | 38 +++++++-------------------------------
1 file changed, 7 insertions(+), 31 deletions(-)
diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index 5f2dafb5..717b2de 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -1852,44 +1852,20 @@ static int ffs_func_eps_enable(struct ffs_function *func)
spin_lock_irqsave(&func->ffs->eps_lock, flags);
while(count--) {
- struct usb_endpoint_descriptor *ds;
- struct usb_ss_ep_comp_descriptor *comp_desc = NULL;
- int needs_comp_desc = false;
- int desc_idx;
-
- if (ffs->gadget->speed == USB_SPEED_SUPER) {
- desc_idx = 2;
- needs_comp_desc = true;
- } else if (ffs->gadget->speed == USB_SPEED_HIGH)
- desc_idx = 1;
- else
- desc_idx = 0;
-
- /* fall-back to lower speed if desc missing for current speed */
- do {
- ds = ep->descs[desc_idx];
- } while (!ds && --desc_idx >= 0);
-
- if (!ds) {
- ret = -EINVAL;
- break;
- }
-
ep->ep->driver_data = ep;
- ep->ep->desc = ds;
- if (needs_comp_desc) {
- comp_desc = (struct usb_ss_ep_comp_descriptor *)(ds +
- USB_DT_ENDPOINT_SIZE);
- ep->ep->maxburst = comp_desc->bMaxBurst + 1;
- ep->ep->comp_desc = comp_desc;
+ ret = config_ep_by_speed(func->gadget, &func->function, ep->ep);
+ if (ret) {
+ pr_err("%s: config_ep_by_speed(%s) returned %d\n",
+ __func__, ep->ep->name, ret);
+ break;
}
ret = usb_ep_enable(ep->ep);
if (likely(!ret)) {
epfile->ep = ep;
- epfile->in = usb_endpoint_dir_in(ds);
- epfile->isoc = usb_endpoint_xfer_isoc(ds);
+ epfile->in = usb_endpoint_dir_in(ep->ep->desc);
+ epfile->isoc = usb_endpoint_xfer_isoc(ep->ep->desc);
} else {
break;
}
--
2.9.1.200.gb1ec08f
On Mon, Jan 22, 2018 at 6:59 PM, Zhang, Ning A <ning.a.zhang(a)intel.com> wrote:
> hello, Greg, Andy, Thomas
>
> would you like to backport these two patches to LTS kernel?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/com…
>
> x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/com…
>
> x86/asm: Rewrite sync_core() to use IRET-to-self
>
I'd be in favor of backporting
1c52d859cb2d417e7216d3e56bb7fea88444cec9. I see no compelling reason
to backport the other one, since it doesn't fix a bug. Greg, can you
do this?
Please add this patch to stable 4.9
It exists in 4.14 and it is not applicable for earlier versions.
------------------------
commit c73322d098e4b6f5f0f0fa1330bf57e218775539
Author: Johannes Weiner <hannes(a)cmpxchg.org>
Date: Wed May 3 14:51:51 2017 -0700
mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
-------------------------
This is a note to let you know that I've just added the patch titled
mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From c73322d098e4b6f5f0f0fa1330bf57e218775539 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes(a)cmpxchg.org>
Date: Wed, 3 May 2017 14:51:51 -0700
Subject: mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
From: Johannes Weiner <hannes(a)cmpxchg.org>
commit c73322d098e4b6f5f0f0fa1330bf57e218775539 upstream.
Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".
Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage. We have seen similar cases at Facebook.
The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages. In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met. Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.
This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer. If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.
Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.
Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.
If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.
This patch (of 9):
Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:
$ echo 4000 >/proc/sys/vm/nr_hugepages
top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3
At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.
Kswapd needs to back away from nodes that fail to balance. Up until
commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism. It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them. This guard was erroneously
removed as the patch changed the definition of a balanced node.
However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.
Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.
[hannes(a)cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb(a)google.com: fix condition for throttle_direct_reclaim]
Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Reported-by: Jia He <hejianet(a)gmail.com>
Tested-by: Jia He <hejianet(a)gmail.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Hillf Danton <hillf.zj(a)alibaba-inc.com>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Dmitry Shmidt <dimitrysh(a)google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 2 ++
mm/internal.h | 6 ++++++
mm/page_alloc.c | 9 ++-------
mm/vmscan.c | 47 ++++++++++++++++++++++++++++++++---------------
mm/vmstat.c | 2 +-
5 files changed, 43 insertions(+), 23 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -633,6 +633,8 @@ typedef struct pglist_data {
int kswapd_order;
enum zone_type kswapd_classzone_idx;
+ int kswapd_failures; /* Number of 'reclaimed == 0' runs */
+
#ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -74,6 +74,12 @@ static inline void set_page_refcounted(s
extern unsigned long highest_memmap_pfn;
/*
+ * Maximum number of reclaim retries without progress before the OOM
+ * killer is consider the only way forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
* in mm/vmscan.c:
*/
extern int isolate_lru_page(struct page *page);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3422,12 +3422,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_ma
}
/*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
* Checks whether it makes sense to retry the reclaim to make a forward progress
* for the given allocation request.
* The reclaim feedback represented by did_some_progress (any progress during
@@ -4385,7 +4379,8 @@ void show_free_areas(unsigned int filter
K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
node_page_state(pgdat, NR_PAGES_SCANNED),
- !pgdat_reclaimable(pgdat) ? "yes" : "no");
+ pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
+ "yes" : "no");
}
for_each_populated_zone(zone) {
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2606,6 +2606,15 @@ static bool shrink_node(pg_data_t *pgdat
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
+ /*
+ * Kswapd gives up on balancing particular nodes after too
+ * many failures to reclaim anything from them and goes to
+ * sleep. On reclaim progress, reset the failure counter. A
+ * successful direct reclaim run will revive a dormant kswapd.
+ */
+ if (reclaimable)
+ pgdat->kswapd_failures = 0;
+
return reclaimable;
}
@@ -2680,10 +2689,6 @@ static void shrink_zones(struct zonelist
GFP_KERNEL | __GFP_HARDWALL))
continue;
- if (sc->priority != DEF_PRIORITY &&
- !pgdat_reclaimable(zone->zone_pgdat))
- continue; /* Let kswapd poll it */
-
/*
* If we already have plenty of memory free for
* compaction in this zone, don't free any more.
@@ -2820,7 +2825,7 @@ retry:
return 0;
}
-static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+static bool allow_direct_reclaim(pg_data_t *pgdat)
{
struct zone *zone;
unsigned long pfmemalloc_reserve = 0;
@@ -2828,6 +2833,9 @@ static bool pfmemalloc_watermark_ok(pg_d
int i;
bool wmark_ok;
+ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+ return true;
+
for (i = 0; i <= ZONE_NORMAL; i++) {
zone = &pgdat->node_zones[i];
if (!managed_zone(zone) ||
@@ -2908,7 +2916,7 @@ static bool throttle_direct_reclaim(gfp_
/* Throttle based on the first usable node */
pgdat = zone->zone_pgdat;
- if (pfmemalloc_watermark_ok(pgdat))
+ if (allow_direct_reclaim(pgdat))
goto out;
break;
}
@@ -2930,14 +2938,14 @@ static bool throttle_direct_reclaim(gfp_
*/
if (!(gfp_mask & __GFP_FS)) {
wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
- pfmemalloc_watermark_ok(pgdat), HZ);
+ allow_direct_reclaim(pgdat), HZ);
goto check_pending;
}
/* Throttle until kswapd wakes the process */
wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
- pfmemalloc_watermark_ok(pgdat));
+ allow_direct_reclaim(pgdat));
check_pending:
if (fatal_signal_pending(current))
@@ -3116,7 +3124,7 @@ static bool prepare_kswapd_sleep(pg_data
/*
* The throttled processes are normally woken up in balance_pgdat() as
- * soon as pfmemalloc_watermark_ok() is true. But there is a potential
+ * soon as allow_direct_reclaim() is true. But there is a potential
* race between when kswapd checks the watermarks and a process gets
* throttled. There is also a potential race if processes get
* throttled, kswapd wakes, a large process exits thereby balancing the
@@ -3130,6 +3138,10 @@ static bool prepare_kswapd_sleep(pg_data
if (waitqueue_active(&pgdat->pfmemalloc_wait))
wake_up_all(&pgdat->pfmemalloc_wait);
+ /* Hopeless node, leave it to direct reclaim */
+ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+ return true;
+
for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -3216,9 +3228,9 @@ static int balance_pgdat(pg_data_t *pgda
count_vm_event(PAGEOUTRUN);
do {
+ unsigned long nr_reclaimed = sc.nr_reclaimed;
bool raise_priority = true;
- sc.nr_reclaimed = 0;
sc.reclaim_idx = classzone_idx;
/*
@@ -3297,7 +3309,7 @@ static int balance_pgdat(pg_data_t *pgda
* able to safely make forward progress. Wake them
*/
if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
- pfmemalloc_watermark_ok(pgdat))
+ allow_direct_reclaim(pgdat))
wake_up_all(&pgdat->pfmemalloc_wait);
/* Check if kswapd should be suspending */
@@ -3308,10 +3320,14 @@ static int balance_pgdat(pg_data_t *pgda
* Raise priority if scanning rate is too low or there was no
* progress in reclaiming pages
*/
- if (raise_priority || !sc.nr_reclaimed)
+ nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+ if (raise_priority || !nr_reclaimed)
sc.priority--;
} while (sc.priority >= 1);
+ if (!sc.nr_reclaimed)
+ pgdat->kswapd_failures++;
+
out:
/*
* Return the order kswapd stopped reclaiming at as
@@ -3511,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, in
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
+ /* Hopeless node, leave it to direct reclaim */
+ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+ return;
+
/* Only wake kswapd if all zones are unbalanced */
for (z = 0; z <= classzone_idx; z++) {
zone = pgdat->node_zones + z;
@@ -3781,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgd
sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
return NODE_RECLAIM_FULL;
- if (!pgdat_reclaimable(pgdat))
- return NODE_RECLAIM_FULL;
-
/*
* Do not scan if the allocation should not be delayed.
*/
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1421,7 +1421,7 @@ static void zoneinfo_show_print(struct s
"\n node_unreclaimable: %u"
"\n start_pfn: %lu"
"\n node_inactive_ratio: %u",
- !pgdat_reclaimable(zone->zone_pgdat),
+ pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
zone->zone_start_pfn,
zone->zone_pgdat->inactive_ratio);
seq_putc(m, '\n');
Patches currently in stable-queue which might be from hannes(a)cmpxchg.org are
queue-4.9/mm-fix-100-cpu-kswapd-busyloop-on-unreclaimable-nodes.patch
queue-4.9/mm-page_alloc-fix-potential-false-positive-in-__zone_watermark_ok.patch
Hi Sasha,
The build logs for linux-4.1 on ARM64 have become completely unreadable
after Olof has updated his toolchain, with many thousand warnings like:
> arm64.allmodconfig:
> /tmp/cc1QjtsZ.s:2500: Warning: ignoring incorrect section type for .init_array.00100
> /tmp/cc1QjtsZ.s:2513: Warning: ignoring incorrect section type for .fini_array.00100
> /tmp/cc3sXmLs.s:77: Warning: ignoring incorrect section type for .init_array.00100
> /tmp/cc3sXmLs.s:90: Warning: ignoring incorrect section type for .fini_array.00100
> /tmp/cc3J6cMe.s:1401: Warning: ignoring incorrect section type for .init_array.00100
>/tmp/cc3J6cMe.s:1414: Warning: ignoring incorrect section type for .fini_array.00100
> /tmp/ccbjphUv.s:1775: Warning: ignoring incorrect section type for .init_array.00100
> /tmp/ccbjphUv.s:1788: Warning: ignoring incorrect section type for .fini_array.00100
> /tmp/ccVPrJUF.s:50: Warning: ignoring incorrect section type for .init_array.00100
Could you pick up commit
cc622420798c ("gcov: disable for COMPILE_TEST")
like Greg did for 3.18 and 4.4?
Many configurations also produce
drivers/mtd/mtd_blkdevs.c:100:2: warning: switch condition has boolean
value [-Wswitch-bool]
which got fixed upstream with
cc7fce802290 ("mtd: blkdevs: fix switch-bool compilation warning")
Once those two are backported, the log files like
http://arm-soc.lixom.net/buildlogs/stable-rc/v4.1.49/
become much more useful.
Thanks,
Arnd