The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 66e839030fd698586734e017fd55c4f2a89dba0b Mon Sep 17 00:00:00 2001
From: Matt Chen <matt.chen(a)intel.com>
Date: Fri, 3 Aug 2018 14:29:20 +0800
Subject: [PATCH] iwlwifi: fix wrong WGDS_WIFI_DATA_SIZE
>From coreboot/BIOS:
Name ("WGDS", Package() {
Revision,
Package() {
DomainType, // 0x7:WiFi ==> We miss this one.
WgdsWiFiSarDeltaGroup1PowerMax1, // Group 1 FCC 2400 Max
WgdsWiFiSarDeltaGroup1PowerChainA1, // Group 1 FCC 2400 A Offset
WgdsWiFiSarDeltaGroup1PowerChainB1, // Group 1 FCC 2400 B Offset
WgdsWiFiSarDeltaGroup1PowerMax2, // Group 1 FCC 5200 Max
WgdsWiFiSarDeltaGroup1PowerChainA2, // Group 1 FCC 5200 A Offset
WgdsWiFiSarDeltaGroup1PowerChainB2, // Group 1 FCC 5200 B Offset
WgdsWiFiSarDeltaGroup2PowerMax1, // Group 2 EC Jap 2400 Max
WgdsWiFiSarDeltaGroup2PowerChainA1, // Group 2 EC Jap 2400 A Offset
WgdsWiFiSarDeltaGroup2PowerChainB1, // Group 2 EC Jap 2400 B Offset
WgdsWiFiSarDeltaGroup2PowerMax2, // Group 2 EC Jap 5200 Max
WgdsWiFiSarDeltaGroup2PowerChainA2, // Group 2 EC Jap 5200 A Offset
WgdsWiFiSarDeltaGroup2PowerChainB2, // Group 2 EC Jap 5200 B Offset
WgdsWiFiSarDeltaGroup3PowerMax1, // Group 3 ROW 2400 Max
WgdsWiFiSarDeltaGroup3PowerChainA1, // Group 3 ROW 2400 A Offset
WgdsWiFiSarDeltaGroup3PowerChainB1, // Group 3 ROW 2400 B Offset
WgdsWiFiSarDeltaGroup3PowerMax2, // Group 3 ROW 5200 Max
WgdsWiFiSarDeltaGroup3PowerChainA2, // Group 3 ROW 5200 A Offset
WgdsWiFiSarDeltaGroup3PowerChainB2, // Group 3 ROW 5200 B Offset
}
})
When read the ACPI data to find out the WGDS, the DATA_SIZE is never
matched.
>From the above format, it gives 19 numbers, but our driver is hardcode
as 18.
Fix it to pass then can parse the data into our wgds table.
Then we will see:
iwlwifi 0000:01:00.0: U iwl_mvm_sar_geo_init Sending GEO_TX_POWER_LIMIT
iwlwifi 0000:01:00.0: U iwl_mvm_sar_geo_init SAR geographic profile[0]
Band[0]: chain A = 68 chain B = 69 max_tx_power = 54
iwlwifi 0000:01:00.0: U iwl_mvm_sar_geo_init SAR geographic profile[0]
Band[1]: chain A = 48 chain B = 49 max_tx_power = 70
iwlwifi 0000:01:00.0: U iwl_mvm_sar_geo_init SAR geographic profile[1]
Band[0]: chain A = 51 chain B = 67 max_tx_power = 50
iwlwifi 0000:01:00.0: U iwl_mvm_sar_geo_init SAR geographic profile[1]
Band[1]: chain A = 69 chain B = 70 max_tx_power = 68
iwlwifi 0000:01:00.0: U iwl_mvm_sar_geo_init SAR geographic profile[2]
Band[0]: chain A = 49 chain B = 50 max_tx_power = 48
iwlwifi 0000:01:00.0: U iwl_mvm_sar_geo_init SAR geographic profile[2]
Band[1]: chain A = 52 chain B = 53 max_tx_power = 51
Cc: stable(a)vger.kernel.org # 4.12+
Fixes: a6bff3cb19b7 ("iwlwifi: mvm: add GEO_TX_POWER_LIMIT cmd for geographic tx power table")
Signed-off-by: Matt Chen <matt.chen(a)intel.com>
Signed-off-by: Luca Coelho <luciano.coelho(a)intel.com>
diff --git a/drivers/net/wireless/intel/iwlwifi/fw/acpi.h b/drivers/net/wireless/intel/iwlwifi/fw/acpi.h
index 2439e98431ee..7492dfb6729b 100644
--- a/drivers/net/wireless/intel/iwlwifi/fw/acpi.h
+++ b/drivers/net/wireless/intel/iwlwifi/fw/acpi.h
@@ -6,6 +6,7 @@
* GPL LICENSE SUMMARY
*
* Copyright(c) 2017 Intel Deutschland GmbH
+ * Copyright(c) 2018 Intel Corporation
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of version 2 of the GNU General Public License as
@@ -26,6 +27,7 @@
* BSD LICENSE
*
* Copyright(c) 2017 Intel Deutschland GmbH
+ * Copyright(c) 2018 Intel Corporation
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
@@ -81,7 +83,7 @@
#define ACPI_WRDS_WIFI_DATA_SIZE (ACPI_SAR_TABLE_SIZE + 2)
#define ACPI_EWRD_WIFI_DATA_SIZE ((ACPI_SAR_PROFILE_NUM - 1) * \
ACPI_SAR_TABLE_SIZE + 3)
-#define ACPI_WGDS_WIFI_DATA_SIZE 18
+#define ACPI_WGDS_WIFI_DATA_SIZE 19
#define ACPI_WRDD_WIFI_DATA_SIZE 2
#define ACPI_SPLC_WIFI_DATA_SIZE 2
diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/fw.c b/drivers/net/wireless/intel/iwlwifi/mvm/fw.c
index dade206d5511..899f4a6432fb 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/fw.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/fw.c
@@ -893,7 +893,7 @@ static int iwl_mvm_sar_geo_init(struct iwl_mvm *mvm)
IWL_DEBUG_RADIO(mvm, "Sending GEO_TX_POWER_LIMIT\n");
BUILD_BUG_ON(ACPI_NUM_GEO_PROFILES * ACPI_WGDS_NUM_BANDS *
- ACPI_WGDS_TABLE_SIZE != ACPI_WGDS_WIFI_DATA_SIZE);
+ ACPI_WGDS_TABLE_SIZE + 1 != ACPI_WGDS_WIFI_DATA_SIZE);
BUILD_BUG_ON(ACPI_NUM_GEO_PROFILES > IWL_NUM_GEO_PROFILES);
Hi Greg,
Few stable candidates for 4.9.y for your consideration.
Cherry picked and build tested on linux-4.9.141 for
ARCH=arm/arm64 + allmodconfig.
Few fixes are applicable for 4.4.y and 3.18.y as well,
but they needed minor rebasing, so I'll submit them
along with other fixes shortly in separate threads.
Regards,
Amit Pundir
Amitkumar Karwar (3):
mwifiex: prevent register accesses after host is sleeping
mwifiex: report error to PCIe for suspend failure
mwifiex: Fix NULL pointer dereference in skb_dequeue()
Johannes Thumshirn (1):
cw1200: Don't leak memory if krealloc failes
Karthik D A (1):
mwifiex: fix p2p device doesn't find in scan problem
Subhash Jadavani (2):
scsi: ufs: fix race between clock gating and devfreq scaling work
scsi: ufshcd: release resources if probe fails
Vasanthakumar Thiagarajan (1):
ath10k: fix kernel panic due to race in accessing arvif list
Venkat Gopalakrishnan (1):
scsi: ufshcd: Fix race between clk scaling and ungate work
Yaniv Gardi (1):
scsi: ufs: fix bugs related to null pointer access and array size
drivers/net/wireless/ath/ath10k/mac.c | 6 ++
drivers/net/wireless/marvell/mwifiex/cfg80211.c | 10 +++-
drivers/net/wireless/marvell/mwifiex/pcie.c | 19 +++++--
drivers/net/wireless/marvell/mwifiex/wmm.c | 12 +++-
drivers/net/wireless/st/cw1200/wsm.c | 16 +++---
drivers/scsi/ufs/ufs.h | 3 +-
drivers/scsi/ufs/ufshcd-pci.c | 2 +
drivers/scsi/ufs/ufshcd-pltfrm.c | 5 +-
drivers/scsi/ufs/ufshcd.c | 75 ++++++++++++++++++++++---
9 files changed, 118 insertions(+), 30 deletions(-)
--
2.7.4
On Thu, 2018-11-29 at 09:12 +0000, Sasha Levin wrote:
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees:
> all
>
> The bot has tested the following trees: v4.19.5, v4.14.84, v4.9.141,
> v4.4.165, v3.18.127,
>
> v4.19.5: Build OK!
> v4.14.84: Build OK!
> v4.9.141: Failed to apply! Possible dependencies:
>
> v4.4.165: Failed to apply! Possible dependencies:
>
> v3.18.127: Failed to apply! Possible dependencies:
>
> How should we proceed with this patch?
I think it's fine to apply it only to 4.19 and 4.14. It's not
imperative that the older kernels get it. People building those kernels
should already have their tools in place; it's not like we expect *new*
users of ancient kernels, who will be tripped up by this.
Currently kernel might allocate different connector ids
for the same outputs in case of DP MST, which seems to
confuse userspace. There are can be different connector
ids in the list, which could be assigned to the same
output, while being in different states.
This results in issues, like external displays staying
blank after quick unplugging and plugging back(bug #106250).
Returning only active DP connectors fixes the issue.
v2: Removed caps from the title
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106250
Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy(a)intel.com>
---
drivers/gpu/drm/drm_mode_config.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c
index ee80788f2c40..ec5b2b08a45e 100644
--- a/drivers/gpu/drm/drm_mode_config.c
+++ b/drivers/gpu/drm/drm_mode_config.c
@@ -143,6 +143,7 @@ int drm_mode_getresources(struct drm_device *dev, void *data,
drm_connector_list_iter_begin(dev, &conn_iter);
count = 0;
connector_id = u64_to_user_ptr(card_res->connector_id_ptr);
+ DRM_DEBUG_KMS("GetResources: writing connectors start");
drm_for_each_connector_iter(connector, &conn_iter) {
/* only expose writeback connectors if userspace understands them */
if (!file_priv->writeback_connectors &&
@@ -150,15 +151,20 @@ int drm_mode_getresources(struct drm_device *dev, void *data,
continue;
if (drm_lease_held(file_priv, connector->base.id)) {
- if (count < card_res->count_connectors &&
- put_user(connector->base.id, connector_id + count)) {
- drm_connector_list_iter_end(&conn_iter);
- return -EFAULT;
+ if (connector->connector_type != DRM_MODE_CONNECTOR_DisplayPort ||
+ connector->status != connector_status_disconnected) {
+ if (count < card_res->count_connectors &&
+ put_user(connector->base.id, connector_id + count)) {
+ drm_connector_list_iter_end(&conn_iter);
+ return -EFAULT;
+ }
+ DRM_DEBUG_KMS("GetResources: connector %s", connector->name);
+ count++;
}
- count++;
}
}
card_res->count_connectors = count;
+ DRM_DEBUG_KMS("GetResources: writing connectors end - count %d", count);
drm_connector_list_iter_end(&conn_iter);
return ret;
--
2.17.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 25bbe21bf427a81b8e3ccd480ea0e1d940256156 Mon Sep 17 00:00:00 2001
From: Matthew Wilcox <willy(a)infradead.org>
Date: Fri, 16 Nov 2018 15:50:02 -0500
Subject: [PATCH] dax: Avoid losing wakeup in dax_lock_mapping_entry
After calling get_unlocked_entry(), you have to call
put_unlocked_entry() to avoid subsequent waiters losing wakeups.
Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Matthew Wilcox <willy(a)infradead.org>
diff --git a/fs/dax.c b/fs/dax.c
index cf2394e2bf4b..9bcce89ea18e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -391,6 +391,7 @@ bool dax_lock_mapping_entry(struct page *page)
rcu_read_unlock();
entry = get_unlocked_entry(&xas);
xas_unlock_irq(&xas);
+ put_unlocked_entry(&xas, entry);
rcu_read_lock();
continue;
}
From: "Cherian, George" <George.Cherian(a)cavium.com>
commit 11644a7659529730eaf2f166efaabe7c3dc7af8c upstream
Implement workaround for ThunderX2 Errata-129 (documented in
CN99XX Known Issues" available at Cavium support site).
As per ThunderX2errata-129, USB 2 device may come up as USB 1
if a connection to a USB 1 device is followed by another connection to
a USB 2 device, the link will come up as USB 1 for the USB 2 device.
Resolution: Reset the PHY after the USB 1 device is disconnected.
The PHY reset sequence is done using private registers in XHCI register
space. After the PHY is reset we check for the PLL lock status and retry
the operation if it fails. From our tests, retrying 4 times is sufficient.
Add a new quirk flag XHCI_RESET_PLL_ON_DISCONNECT to invoke the workaround
in handle_xhci_port_status().
Cc: stable(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # 4.14.x: 36b6857: xhci: Allow more than 32 quirks
Signed-off-by: George Cherian <george.cherian(a)cavium.com>
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
There is a conflict while cherry-pick of 36b6857: xhci: Allow more than
32 quirks. It is trivial to resolve. Let me know in case if it is an
issue.
drivers/usb/host/xhci-pci.c | 5 +++++
drivers/usb/host/xhci-ring.c | 35 ++++++++++++++++++++++++++++++++++-
drivers/usb/host/xhci.h | 1 +
3 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index 9218f506f8e3..4b07b6859b4c 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -236,6 +236,11 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
if (pdev->vendor == PCI_VENDOR_ID_TI && pdev->device == 0x8241)
xhci->quirks |= XHCI_LIMIT_ENDPOINT_INTERVAL_7;
+ if ((pdev->vendor == PCI_VENDOR_ID_BROADCOM ||
+ pdev->vendor == PCI_VENDOR_ID_CAVIUM) &&
+ pdev->device == 0x9026)
+ xhci->quirks |= XHCI_RESET_PLL_ON_DISCONNECT;
+
if (xhci->quirks & XHCI_RESET_ON_RESUME)
xhci_dbg_trace(xhci, trace_xhci_dbg_quirks,
"QUIRK: Resetting on resume");
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 6996235e34a9..ea35f346d26b 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1568,6 +1568,35 @@ static void handle_device_notification(struct xhci_hcd *xhci,
usb_wakeup_notification(udev->parent, udev->portnum);
}
+/*
+ * Quirk hanlder for errata seen on Cavium ThunderX2 processor XHCI
+ * Controller.
+ * As per ThunderX2errata-129 USB 2 device may come up as USB 1
+ * If a connection to a USB 1 device is followed by another connection
+ * to a USB 2 device.
+ *
+ * Reset the PHY after the USB device is disconnected if device speed
+ * is less than HCD_USB3.
+ * Retry the reset sequence max of 4 times checking the PLL lock status.
+ *
+ */
+static void xhci_cavium_reset_phy_quirk(struct xhci_hcd *xhci)
+{
+ struct usb_hcd *hcd = xhci_to_hcd(xhci);
+ u32 pll_lock_check;
+ u32 retry_count = 4;
+
+ do {
+ /* Assert PHY reset */
+ writel(0x6F, hcd->regs + 0x1048);
+ udelay(10);
+ /* De-assert the PHY reset */
+ writel(0x7F, hcd->regs + 0x1048);
+ udelay(200);
+ pll_lock_check = readl(hcd->regs + 0x1070);
+ } while (!(pll_lock_check & 0x1) && --retry_count);
+}
+
static void handle_port_status(struct xhci_hcd *xhci,
union xhci_trb *event)
{
@@ -1725,9 +1754,13 @@ static void handle_port_status(struct xhci_hcd *xhci,
goto cleanup;
}
- if (hcd->speed < HCD_USB3)
+ if (hcd->speed < HCD_USB3) {
xhci_test_and_clear_bit(xhci, port_array, faked_port_index,
PORT_PLC);
+ if ((xhci->quirks & XHCI_RESET_PLL_ON_DISCONNECT) &&
+ (portsc & PORT_CSC) && !(portsc & PORT_CONNECT))
+ xhci_cavium_reset_phy_quirk(xhci);
+ }
cleanup:
/* Update event ring dequeue pointer before dropping the lock */
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index d7d2a3dfafb8..84457fc192fc 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1836,6 +1836,7 @@ struct xhci_hcd {
#define XHCI_U2_DISABLE_WAKE BIT_ULL(27)
#define XHCI_ASMEDIA_MODIFY_FLOWCONTROL BIT_ULL(28)
#define XHCI_SUSPEND_DELAY BIT_ULL(30)
+#define XHCI_RESET_PLL_ON_DISCONNECT BIT_ULL(34)
unsigned int num_active_eps;
unsigned int limit_active_eps;
--
2.19.2
The patch titled
Subject: mm/khugepaged: collapse_shmem() do not crash on Compound
has been added to the -mm tree. Its filename is
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-khugepaged-collapse_shmem-do-no…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-khugepaged-collapse_shmem-do-no…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/khugepaged: collapse_shmem() do not crash on Compound
collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before it
holds page lock of the first page, racing truncation then extension might
conceivably have inserted a hugepage there already. Fail with the
SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
otherwise mishandling the unexpected hugepage - though later we might code
up a more constructive way of handling it, with SCAN_SUCCESS.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/khugepaged.c~mm-khugepaged-collapse_shmem-do-not-crash-on-compound
+++ a/mm/khugepaged.c
@@ -1399,7 +1399,15 @@ static void collapse_shmem(struct mm_str
*/
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageUptodate(page), page);
- VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+ /*
+ * If file was truncated then extended, or hole-punched, before
+ * we locked the first page, then a THP might be there already.
+ */
+ if (PageTransCompound(page)) {
+ result = SCAN_PAGE_COMPOUND;
+ goto out_unlock;
+ }
if (page_mapping(page) != mapping) {
result = SCAN_TRUNCATED;
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
mm-khugepaged-fix-the-xas_create_range-error-path.patch
mm-put_and_wait_on_page_locked-while-page-is-migrated.patch
The patch titled
Subject: mm/khugepaged: collapse_shmem() without freezing new_page
has been added to the -mm tree. Its filename is
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-khugepaged-collapse_shmem-witho…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-khugepaged-collapse_shmem-witho…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/khugepaged: collapse_shmem() without freezing new_page
khugepaged's collapse_shmem() does almost all of its work, to assemble the
huge new_page from 512 scattered old pages, with the new_page's refcount
frozen to 0 (and refcounts of all old pages so far also frozen to 0).
Including shmem_getpage() to read in any which were out on swap, memory
reclaim if necessary to allocate their intermediate pages, and copying
over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting to
spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are all
unfrozen on success or failure: just as before. They could be unfrozen
sooner, but cause no problem once no longer visible to find_get_entry(),
filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/khugepaged.c~mm-khugepaged-collapse_shmem-without-freezing-new_page
+++ a/mm/khugepaged.c
@@ -1287,7 +1287,7 @@ static void retract_page_tables(struct a
* collapse_shmem - collapse small tmpfs/shmem pages into huge one.
*
* Basic scheme is simple, details are more complex:
- * - allocate and freeze a new huge page;
+ * - allocate and lock a new huge page;
* - scan page cache replacing old pages with the new one
* + swap in pages if necessary;
* + fill in gaps;
@@ -1295,11 +1295,11 @@ static void retract_page_tables(struct a
* - if replacing succeeds:
* + copy data over;
* + free old pages;
- * + unfreeze huge page;
+ * + unlock huge page;
* - if replacing failed;
* + put all pages back and unfreeze them;
* + restore gaps in the page cache;
- * + free huge page;
+ * + unlock and free huge page;
*/
static void collapse_shmem(struct mm_struct *mm,
struct address_space *mapping, pgoff_t start,
@@ -1333,13 +1333,11 @@ static void collapse_shmem(struct mm_str
__SetPageSwapBacked(new_page);
new_page->index = start;
new_page->mapping = mapping;
- BUG_ON(!page_ref_freeze(new_page, 1));
/*
- * At this point the new_page is 'frozen' (page_count() is zero),
- * locked and not up-to-date. It's safe to insert it into the page
- * cache, because nobody would be able to map it or use it in other
- * way until we unfreeze it.
+ * At this point the new_page is locked and not up-to-date.
+ * It's safe to insert it into the page cache, because nobody would
+ * be able to map it or use it in another way until we unlock it.
*/
/* This will be less messy when we use multi-index entries */
@@ -1491,9 +1489,8 @@ xa_unlocked:
index++;
}
- /* Everything is ready, let's unfreeze the new_page */
SetPageUptodate(new_page);
- page_ref_unfreeze(new_page, HPAGE_PMD_NR);
+ page_ref_add(new_page, HPAGE_PMD_NR - 1);
set_page_dirty(new_page);
mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_anon(new_page);
@@ -1541,8 +1538,6 @@ xa_unlocked:
VM_BUG_ON(nr_none);
xas_unlock_irq(&xas);
- /* Unfreeze new_page, caller would take care about freeing it */
- page_ref_unfreeze(new_page, 1);
mem_cgroup_cancel_charge(new_page, memcg, true);
new_page->mapping = NULL;
}
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
mm-khugepaged-fix-the-xas_create_range-error-path.patch
mm-put_and_wait_on_page_locked-while-page-is-migrated.patch
The patch titled
Subject: mm/khugepaged: minor reorderings in collapse_shmem()
has been added to the -mm tree. Its filename is
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-khugepaged-minor-reorderings-in…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-khugepaged-minor-reorderings-in…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/khugepaged: minor reorderings in collapse_shmem()
Several cleanups in collapse_shmem(): most of which probably do not really
matter, beyond doing things in a more familiar and reassuring order.
Simplify the failure gotos in the main loop, and on success update stats
while interrupts still disabled from the last iteration.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/khugepaged.c~mm-khugepaged-minor-reorderings-in-collapse_shmem
+++ a/mm/khugepaged.c
@@ -1329,10 +1329,10 @@ static void collapse_shmem(struct mm_str
goto out;
}
+ __SetPageLocked(new_page);
+ __SetPageSwapBacked(new_page);
new_page->index = start;
new_page->mapping = mapping;
- __SetPageSwapBacked(new_page);
- __SetPageLocked(new_page);
BUG_ON(!page_ref_freeze(new_page, 1));
/*
@@ -1366,13 +1366,13 @@ static void collapse_shmem(struct mm_str
if (index == start) {
if (!xas_next_entry(&xas, end - 1)) {
result = SCAN_TRUNCATED;
- break;
+ goto xa_locked;
}
xas_set(&xas, index);
}
if (!shmem_charge(mapping->host, 1)) {
result = SCAN_FAIL;
- break;
+ goto xa_locked;
}
xas_store(&xas, new_page + (index % HPAGE_PMD_NR));
nr_none++;
@@ -1387,13 +1387,12 @@ static void collapse_shmem(struct mm_str
result = SCAN_FAIL;
goto xa_unlocked;
}
- xas_lock_irq(&xas);
- xas_set(&xas, index);
} else if (trylock_page(page)) {
get_page(page);
+ xas_unlock_irq(&xas);
} else {
result = SCAN_PAGE_LOCK;
- break;
+ goto xa_locked;
}
/*
@@ -1408,11 +1407,10 @@ static void collapse_shmem(struct mm_str
result = SCAN_TRUNCATED;
goto out_unlock;
}
- xas_unlock_irq(&xas);
if (isolate_lru_page(page)) {
result = SCAN_DEL_PAGE_LRU;
- goto out_isolate_failed;
+ goto out_unlock;
}
if (page_mapped(page))
@@ -1432,7 +1430,9 @@ static void collapse_shmem(struct mm_str
*/
if (!page_ref_freeze(page, 3)) {
result = SCAN_PAGE_COUNT;
- goto out_lru;
+ xas_unlock_irq(&xas);
+ putback_lru_page(page);
+ goto out_unlock;
}
/*
@@ -1444,24 +1444,26 @@ static void collapse_shmem(struct mm_str
/* Finally, replace with the new page. */
xas_store(&xas, new_page + (index % HPAGE_PMD_NR));
continue;
-out_lru:
- xas_unlock_irq(&xas);
- putback_lru_page(page);
-out_isolate_failed:
- unlock_page(page);
- put_page(page);
- goto xa_unlocked;
out_unlock:
unlock_page(page);
put_page(page);
- break;
+ goto xa_unlocked;
+ }
+
+ __inc_node_page_state(new_page, NR_SHMEM_THPS);
+ if (nr_none) {
+ struct zone *zone = page_zone(new_page);
+
+ __mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
+ __mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
}
- xas_unlock_irq(&xas);
+xa_locked:
+ xas_unlock_irq(&xas);
xa_unlocked:
+
if (result == SCAN_SUCCEED) {
struct page *page, *tmp;
- struct zone *zone = page_zone(new_page);
/*
* Replacing old pages with new one has succeeded, now we
@@ -1476,11 +1478,11 @@ xa_unlocked:
copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
page);
list_del(&page->lru);
- unlock_page(page);
- page_ref_unfreeze(page, 1);
page->mapping = NULL;
+ page_ref_unfreeze(page, 1);
ClearPageActive(page);
ClearPageUnevictable(page);
+ unlock_page(page);
put_page(page);
index++;
}
@@ -1489,28 +1491,17 @@ xa_unlocked:
index++;
}
- local_irq_disable();
- __inc_node_page_state(new_page, NR_SHMEM_THPS);
- if (nr_none) {
- __mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
- __mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
- }
- local_irq_enable();
-
- /*
- * Remove pte page tables, so we can re-fault
- * the page as huge.
- */
- retract_page_tables(mapping, start);
-
/* Everything is ready, let's unfreeze the new_page */
- set_page_dirty(new_page);
SetPageUptodate(new_page);
page_ref_unfreeze(new_page, HPAGE_PMD_NR);
+ set_page_dirty(new_page);
mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_anon(new_page);
- unlock_page(new_page);
+ /*
+ * Remove pte page tables, so we can re-fault the page as huge.
+ */
+ retract_page_tables(mapping, start);
*hpage = NULL;
khugepaged_pages_collapsed++;
@@ -1543,8 +1534,8 @@ xa_unlocked:
xas_store(&xas, page);
xas_pause(&xas);
xas_unlock_irq(&xas);
- putback_lru_page(page);
unlock_page(page);
+ putback_lru_page(page);
xas_lock_irq(&xas);
}
VM_BUG_ON(nr_none);
@@ -1553,9 +1544,10 @@ xa_unlocked:
/* Unfreeze new_page, caller would take care about freeing it */
page_ref_unfreeze(new_page, 1);
mem_cgroup_cancel_charge(new_page, memcg, true);
- unlock_page(new_page);
new_page->mapping = NULL;
}
+
+ unlock_page(new_page);
out:
VM_BUG_ON(!list_empty(&pagelist));
/* TODO: tracepoints */
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
mm-khugepaged-fix-the-xas_create_range-error-path.patch
mm-put_and_wait_on_page_locked-while-page-is-migrated.patch
The patch titled
Subject: mm/khugepaged: collapse_shmem() remember to clear holes
has been added to the -mm tree. Its filename is
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-khugepaged-collapse_shmem-remem…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-khugepaged-collapse_shmem-remem…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/khugepaged: collapse_shmem() remember to clear holes
Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp flags
khugepaged uses to allocate a huge page - in all common cases it would
just be a waste of effort - so collapse_shmem() must remember to clear out
any holes that it instantiates.
The obvious place to do so, where they are put into the page cache tree,
is not a good choice: because interrupts are disabled there. Leave it
until further down, once success is assured, where the other pages are
copied (before setting PageUptodate).
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/khugepaged.c~mm-khugepaged-collapse_shmem-remember-to-clear-holes
+++ a/mm/khugepaged.c
@@ -1467,7 +1467,12 @@ xa_unlocked:
* Replacing old pages with new one has succeeded, now we
* need to copy the content and free the old pages.
*/
+ index = start;
list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+ while (index < page->index) {
+ clear_highpage(new_page + (index % HPAGE_PMD_NR));
+ index++;
+ }
copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
page);
list_del(&page->lru);
@@ -1477,6 +1482,11 @@ xa_unlocked:
ClearPageActive(page);
ClearPageUnevictable(page);
put_page(page);
+ index++;
+ }
+ while (index < end) {
+ clear_highpage(new_page + (index % HPAGE_PMD_NR));
+ index++;
}
local_irq_disable();
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
mm-khugepaged-fix-the-xas_create_range-error-path.patch
mm-put_and_wait_on_page_locked-while-page-is-migrated.patch
The patch titled
Subject: mm/khugepaged: fix crashes due to misaccounted holes
has been added to the -mm tree. Its filename is
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-khugepaged-fix-crashes-due-to-m…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-khugepaged-fix-crashes-due-to-m…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/khugepaged: fix crashes due to misaccounted holes
Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent hit
shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by clear_inode()'s
BUG_ON(inode->i_data.nrpages) when the file was later closed and unlinked.
khugepaged's collapse_shmem() was forgetting to update mapping->nrpages on
the rollback path, after it had added but then needs to undo some holes.
There is indeed an irritating asymmetry between shmem_charge(), whose
callers want it to increment nrpages after successfully accounting blocks,
and shmem_uncharge(), when __delete_from_page_cache() already decremented
nrpages itself: oh well, just add a comment on that to them both.
And shmem_recalc_inode() is supposed to be called when the accounting is
expected to be in balance (so it can deduce from imbalance that reclaim
discarded some pages): so change shmem_charge() to update nrpages earlier
(though it's rare for the difference to matter at all).
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/khugepaged.c~mm-khugepaged-fix-crashes-due-to-misaccounted-holes
+++ a/mm/khugepaged.c
@@ -1506,9 +1506,12 @@ xa_unlocked:
khugepaged_pages_collapsed++;
} else {
struct page *page;
+
/* Something went wrong: roll back page cache changes */
- shmem_uncharge(mapping->host, nr_none);
xas_lock_irq(&xas);
+ mapping->nrpages -= nr_none;
+ shmem_uncharge(mapping->host, nr_none);
+
xas_set(&xas, start);
xas_for_each(&xas, page, end - 1) {
page = list_first_entry_or_null(&pagelist,
--- a/mm/shmem.c~mm-khugepaged-fix-crashes-due-to-misaccounted-holes
+++ a/mm/shmem.c
@@ -297,12 +297,14 @@ bool shmem_charge(struct inode *inode, l
if (!shmem_inode_acct_block(inode, pages))
return false;
+ /* nrpages adjustment first, then shmem_recalc_inode() when balanced */
+ inode->i_mapping->nrpages += pages;
+
spin_lock_irqsave(&info->lock, flags);
info->alloced += pages;
inode->i_blocks += pages * BLOCKS_PER_PAGE;
shmem_recalc_inode(inode);
spin_unlock_irqrestore(&info->lock, flags);
- inode->i_mapping->nrpages += pages;
return true;
}
@@ -312,6 +314,8 @@ void shmem_uncharge(struct inode *inode,
struct shmem_inode_info *info = SHMEM_I(inode);
unsigned long flags;
+ /* nrpages adjustment done by __delete_from_page_cache() or caller */
+
spin_lock_irqsave(&info->lock, flags);
info->alloced -= pages;
inode->i_blocks -= pages * BLOCKS_PER_PAGE;
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
mm-khugepaged-fix-the-xas_create_range-error-path.patch
mm-put_and_wait_on_page_locked-while-page-is-migrated.patch
The patch titled
Subject: mm/khugepaged: collapse_shmem() stop if punched or truncated
has been added to the -mm tree. Its filename is
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-khugepaged-collapse_shmem-stop-…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-khugepaged-collapse_shmem-stop-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/khugepaged: collapse_shmem() stop if punched or truncated
Huge tmpfs testing showed that although collapse_shmem() recognizes a
concurrently truncated or hole-punched page correctly, its handling of
holes was liable to refill an emptied extent. Add check to stop that.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Matthew Wilcox <willy(a)infradead.org>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/khugepaged.c~mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated
+++ a/mm/khugepaged.c
@@ -1359,6 +1359,17 @@ static void collapse_shmem(struct mm_str
VM_BUG_ON(index != xas.xa_index);
if (!page) {
+ /*
+ * Stop if extent has been truncated or hole-punched,
+ * and is now completely empty.
+ */
+ if (index == start) {
+ if (!xas_next_entry(&xas, end - 1)) {
+ result = SCAN_TRUNCATED;
+ break;
+ }
+ xas_set(&xas, index);
+ }
if (!shmem_charge(mapping->host, 1)) {
result = SCAN_FAIL;
break;
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
mm-khugepaged-fix-the-xas_create_range-error-path.patch
mm-put_and_wait_on_page_locked-while-page-is-migrated.patch
The patch titled
Subject: mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
has been added to the -mm tree. Its filename is
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-huge_memory-fix-lockdep-complai…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-huge_memory-fix-lockdep-complai…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that
__split_huge_page() was using i_size_read() while holding the irq-safe
lru_lock and page tree lock, but the 32-bit i_size_read() uses an
irq-unsafe seqlock which should not be nested inside them.
Instead, read the i_size earlier in split_huge_page_to_list(), and pass
the end offset down to __split_huge_page(): all while holding head page
lock, which is enough to prevent truncation of that extent before the page
tree lock has been taken.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils
Fixes: baa355fd33142 ("thp: file pages support for split_huge_page()")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/huge_memory.c~mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read
+++ a/mm/huge_memory.c
@@ -2439,12 +2439,11 @@ static void __split_huge_page_tail(struc
}
static void __split_huge_page(struct page *page, struct list_head *list,
- unsigned long flags)
+ pgoff_t end, unsigned long flags)
{
struct page *head = compound_head(page);
struct zone *zone = page_zone(head);
struct lruvec *lruvec;
- pgoff_t end = -1;
int i;
lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
@@ -2452,9 +2451,6 @@ static void __split_huge_page(struct pag
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);
- if (!PageAnon(page))
- end = DIV_ROUND_UP(i_size_read(head->mapping->host), PAGE_SIZE);
-
for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
/* Some pages can be beyond i_size: drop them from page cache */
@@ -2626,6 +2622,7 @@ int split_huge_page_to_list(struct page
int count, mapcount, extra_pins, ret;
bool mlocked;
unsigned long flags;
+ pgoff_t end;
VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -2648,6 +2645,7 @@ int split_huge_page_to_list(struct page
ret = -EBUSY;
goto out;
}
+ end = -1;
mapping = NULL;
anon_vma_lock_write(anon_vma);
} else {
@@ -2661,6 +2659,15 @@ int split_huge_page_to_list(struct page
anon_vma = NULL;
i_mmap_lock_read(mapping);
+
+ /*
+ *__split_huge_page() may need to trim off pages beyond EOF:
+ * but on 32-bit, i_size_read() takes an irq-unsafe seqlock,
+ * which cannot be nested inside the page tree lock. So note
+ * end now: i_size itself may be changed at any moment, but
+ * head page lock is good enough to serialize the trimming.
+ */
+ end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
}
/*
@@ -2707,7 +2714,7 @@ int split_huge_page_to_list(struct page
if (mapping)
__dec_node_page_state(page, NR_SHMEM_THPS);
spin_unlock(&pgdata->split_queue_lock);
- __split_huge_page(page, list, flags);
+ __split_huge_page(page, list, end, flags);
if (PageSwapCache(head)) {
swp_entry_t entry = { .val = page_private(head) };
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
mm-khugepaged-fix-the-xas_create_range-error-path.patch
mm-put_and_wait_on_page_locked-while-page-is-migrated.patch
The patch titled
Subject: mm/huge_memory: splitting set mapping+index before unfreeze
has been added to the -mm tree. Its filename is
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-huge_memory-splitting-set-mappi…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-huge_memory-splitting-set-mappi…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/huge_memory: splitting set mapping+index before unfreeze
Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s
VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page).
Move the setting of mapping and index up before the page_ref_unfreeze() in
__split_huge_page_tail() to fix this: so that a page cache lookup cannot
get a reference while the tail's mapping and index are unstable.
In fact, might as well move them up before the smp_wmb(): I don't see an
actual need for that, but if I'm missing something, this way round is
safer than the other, and no less efficient.
You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is
misplaced, and should be left until after the trylock_page(); but left as
is has not crashed since, and gives more stringent assurance.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils
Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
Requires: 605ca5ede764 ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/huge_memory.c~mm-huge_memory-splitting-set-mappingindex-before-unfreeze
+++ a/mm/huge_memory.c
@@ -2402,6 +2402,12 @@ static void __split_huge_page_tail(struc
(1L << PG_unevictable) |
(1L << PG_dirty)));
+ /* ->mapping in first tail page is compound_mapcount */
+ VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
+ page_tail);
+ page_tail->mapping = head->mapping;
+ page_tail->index = head->index + tail;
+
/* Page flags must be visible before we make the page non-compound. */
smp_wmb();
@@ -2422,12 +2428,6 @@ static void __split_huge_page_tail(struc
if (page_is_idle(head))
set_page_idle(page_tail);
- /* ->mapping in first tail page is compound_mapcount */
- VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
- page_tail);
- page_tail->mapping = head->mapping;
-
- page_tail->index = head->index + tail;
page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
/*
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
mm-khugepaged-fix-the-xas_create_range-error-path.patch
mm-put_and_wait_on_page_locked-while-page-is-migrated.patch
The patch titled
Subject: mm/huge_memory: rename freeze_page() to unmap_page()
has been added to the -mm tree. Its filename is
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-huge_memory-rename-freeze_page-…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-huge_memory-rename-freeze_page-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/huge_memory: rename freeze_page() to unmap_page()
The term "freeze" is used in several ways in the kernel, and in mm it has
the particular meaning of forcing page refcount temporarily to 0.
freeze_page() is just too confusing a name for a function that unmaps a
page: rename it unmap_page(), and rename unfreeze_page() remap_page().
Went to change the mention of freeze_page() added later in mm/rmap.c, but
found it to be incorrect: ordinary page reclaim reaches there too; but the
substance of the comment still seems correct, so edit it down.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/huge_memory.c~mm-huge_memory-rename-freeze_page-to-unmap_page
+++ a/mm/huge_memory.c
@@ -2350,7 +2350,7 @@ void vma_adjust_trans_huge(struct vm_are
}
}
-static void freeze_page(struct page *page)
+static void unmap_page(struct page *page)
{
enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
@@ -2365,7 +2365,7 @@ static void freeze_page(struct page *pag
VM_BUG_ON_PAGE(!unmap_success, page);
}
-static void unfreeze_page(struct page *page)
+static void remap_page(struct page *page)
{
int i;
if (PageTransHuge(page)) {
@@ -2483,7 +2483,7 @@ static void __split_huge_page(struct pag
spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
- unfreeze_page(head);
+ remap_page(head);
for (i = 0; i < HPAGE_PMD_NR; i++) {
struct page *subpage = head + i;
@@ -2664,7 +2664,7 @@ int split_huge_page_to_list(struct page
}
/*
- * Racy check if we can split the page, before freeze_page() will
+ * Racy check if we can split the page, before unmap_page() will
* split PMDs
*/
if (!can_split_huge_page(head, &extra_pins)) {
@@ -2673,7 +2673,7 @@ int split_huge_page_to_list(struct page
}
mlocked = PageMlocked(page);
- freeze_page(head);
+ unmap_page(head);
VM_BUG_ON_PAGE(compound_mapcount(head), head);
/* Make sure the page is not on per-CPU pagevec as it takes pin */
@@ -2727,7 +2727,7 @@ int split_huge_page_to_list(struct page
fail: if (mapping)
xa_unlock(&mapping->i_pages);
spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
- unfreeze_page(head);
+ remap_page(head);
ret = -EBUSY;
}
--- a/mm/rmap.c~mm-huge_memory-rename-freeze_page-to-unmap_page
+++ a/mm/rmap.c
@@ -1627,16 +1627,9 @@ static bool try_to_unmap_one(struct page
address + PAGE_SIZE);
} else {
/*
- * We should not need to notify here as we reach this
- * case only from freeze_page() itself only call from
- * split_huge_page_to_list() so everything below must
- * be true:
- * - page is not anonymous
- * - page is locked
- *
- * So as it is a locked file back page thus it can not
- * be remove from the page cache and replace by a new
- * page before mmu_notifier_invalidate_range_end so no
+ * This is a locked file-backed page, thus it cannot
+ * be removed from the page cache and replaced by a new
+ * page before mmu_notifier_invalidate_range_end, so no
* concurrent thread might update its page table to
* point at new page while a device still is using this
* page.
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-huge_memory-rename-freeze_page-to-unmap_page.patch
mm-huge_memory-splitting-set-mappingindex-before-unfreeze.patch
mm-huge_memory-fix-lockdep-complaint-on-32-bit-i_size_read.patch
mm-khugepaged-collapse_shmem-stop-if-punched-or-truncated.patch
mm-khugepaged-fix-crashes-due-to-misaccounted-holes.patch
mm-khugepaged-collapse_shmem-remember-to-clear-holes.patch
mm-khugepaged-minor-reorderings-in-collapse_shmem.patch
mm-khugepaged-collapse_shmem-without-freezing-new_page.patch
mm-khugepaged-collapse_shmem-do-not-crash-on-compound.patch
mm-khugepaged-fix-the-xas_create_range-error-path.patch
mm-put_and_wait_on_page_locked-while-page-is-migrated.patch
From: Dexuan Cui <decui(a)microsoft.com>
We can concurrently try to open the same sub-channel from 2 paths:
path #1: vmbus_onoffer() -> vmbus_process_offer() -> handle_sc_creation().
path #2: storvsc_probe() -> storvsc_connect_to_vsp() ->
-> storvsc_channel_init() -> handle_multichannel_storage() ->
-> vmbus_are_subchannels_present() -> handle_sc_creation().
They conflict with each other, but it was not an issue before the recent
commit ae6935ed7d42 ("vmbus: split ring buffer allocation from open"),
because at the beginning of vmbus_open() we checked newchannel->state so
only one path could succeed, and the other would return with -EINVAL.
After ae6935ed7d42, the failing path frees the channel's ringbuffer by
vmbus_free_ring(), and this causes a panic later.
Commit ae6935ed7d42 itself is good, and it just reveals the longstanding
race. We can resolve the issue by removing path #2, i.e. removing the
second vmbus_are_subchannels_present() in handle_multichannel_storage().
BTW, the comment "Check to see if sub-channels have already been created"
in handle_multichannel_storage() is incorrect: when we unload the driver,
we first close the sub-channel(s) and then close the primary channel, next
the host sends rescind-offer message(s) so primary->sc_list will become
empty. This means the first vmbus_are_subchannels_present() in
handle_multichannel_storage() is never useful.
Fixes: ae6935ed7d42 ("vmbus: split ring buffer allocation from open")
Cc: stable(a)vger.kernel.org
Cc: Long Li <longli(a)microsoft.com>
Cc: Stephen Hemminger <sthemmin(a)microsoft.com>
Cc: K. Y. Srinivasan <kys(a)microsoft.com>
Cc: Haiyang Zhang <haiyangz(a)microsoft.com>
Signed-off-by: Dexuan Cui <decui(a)microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys(a)microsoft.com>
---
drivers/scsi/storvsc_drv.c | 61 +++++++++++++++++++-------------------
1 file changed, 30 insertions(+), 31 deletions(-)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index f03dc03a42c3..8f88348ebe42 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -446,7 +446,6 @@ struct storvsc_device {
bool destroy;
bool drain_notify;
- bool open_sub_channel;
atomic_t num_outstanding_req;
struct Scsi_Host *host;
@@ -636,33 +635,38 @@ static inline struct storvsc_device *get_in_stor_device(
static void handle_sc_creation(struct vmbus_channel *new_sc)
{
struct hv_device *device = new_sc->primary_channel->device_obj;
+ struct device *dev = &device->device;
struct storvsc_device *stor_device;
struct vmstorage_channel_properties props;
+ int ret;
stor_device = get_out_stor_device(device);
if (!stor_device)
return;
- if (stor_device->open_sub_channel == false)
- return;
-
memset(&props, 0, sizeof(struct vmstorage_channel_properties));
- vmbus_open(new_sc,
- storvsc_ringbuffer_size,
- storvsc_ringbuffer_size,
- (void *)&props,
- sizeof(struct vmstorage_channel_properties),
- storvsc_on_channel_callback, new_sc);
+ ret = vmbus_open(new_sc,
+ storvsc_ringbuffer_size,
+ storvsc_ringbuffer_size,
+ (void *)&props,
+ sizeof(struct vmstorage_channel_properties),
+ storvsc_on_channel_callback, new_sc);
- if (new_sc->state == CHANNEL_OPENED_STATE) {
- stor_device->stor_chns[new_sc->target_cpu] = new_sc;
- cpumask_set_cpu(new_sc->target_cpu, &stor_device->alloced_cpus);
+ /* In case vmbus_open() fails, we don't use the sub-channel. */
+ if (ret != 0) {
+ dev_err(dev, "Failed to open sub-channel: err=%d\n", ret);
+ return;
}
+
+ /* Add the sub-channel to the array of available channels. */
+ stor_device->stor_chns[new_sc->target_cpu] = new_sc;
+ cpumask_set_cpu(new_sc->target_cpu, &stor_device->alloced_cpus);
}
static void handle_multichannel_storage(struct hv_device *device, int max_chns)
{
+ struct device *dev = &device->device;
struct storvsc_device *stor_device;
int num_cpus = num_online_cpus();
int num_sc;
@@ -679,21 +683,11 @@ static void handle_multichannel_storage(struct hv_device *device, int max_chns)
request = &stor_device->init_request;
vstor_packet = &request->vstor_packet;
- stor_device->open_sub_channel = true;
/*
* Establish a handler for dealing with subchannels.
*/
vmbus_set_sc_create_callback(device->channel, handle_sc_creation);
- /*
- * Check to see if sub-channels have already been created. This
- * can happen when this driver is re-loaded after unloading.
- */
-
- if (vmbus_are_subchannels_present(device->channel))
- return;
-
- stor_device->open_sub_channel = false;
/*
* Request the host to create sub-channels.
*/
@@ -710,23 +704,29 @@ static void handle_multichannel_storage(struct hv_device *device, int max_chns)
VM_PKT_DATA_INBAND,
VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
- if (ret != 0)
+ if (ret != 0) {
+ dev_err(dev, "Failed to create sub-channel: err=%d\n", ret);
return;
+ }
t = wait_for_completion_timeout(&request->wait_event, 10*HZ);
- if (t == 0)
+ if (t == 0) {
+ dev_err(dev, "Failed to create sub-channel: timed out\n");
return;
+ }
if (vstor_packet->operation != VSTOR_OPERATION_COMPLETE_IO ||
- vstor_packet->status != 0)
+ vstor_packet->status != 0) {
+ dev_err(dev, "Failed to create sub-channel: op=%d, sts=%d\n",
+ vstor_packet->operation, vstor_packet->status);
return;
+ }
/*
- * Now that we created the sub-channels, invoke the check; this
- * may trigger the callback.
+ * We need to do nothing here, because vmbus_process_offer()
+ * invokes channel->sc_creation_callback, which will open and use
+ * the sub-channel(s).
*/
- stor_device->open_sub_channel = true;
- vmbus_are_subchannels_present(device->channel);
}
static void cache_wwn(struct storvsc_device *stor_device,
@@ -1794,7 +1794,6 @@ static int storvsc_probe(struct hv_device *device,
}
stor_device->destroy = false;
- stor_device->open_sub_channel = false;
init_waitqueue_head(&stor_device->waiting_to_drain);
stor_device->device = device;
stor_device->host = host;
--
2.19.1
The patch titled
Subject: zram: writeback throttle
has been added to the -mm tree. Its filename is
zram-writeback-throttle.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/zram-writeback-throttle.patch
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/zram-writeback-throttle.patch
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Minchan Kim <minchan(a)kernel.org>
Subject: zram: writeback throttle
On small memory systems there are lots of write IOs so if we use a flash
device as swap there would be serious flash wearout. To overcome this
problem, system developers need to design a write limitation strategy to
guarantee flash health for the entire product life.
This patch creates a new knob "writeback_limit" on zram. With that, if
the current writeback IO count (/sys/block/zramX/io_stat) exceeds the
limitation, zram stops further writeback until the admin can reset the
limit.
Link: http://lkml.kernel.org/r/20181127055429.251614-8-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan(a)kernel.org>
Cc: Joey Pabalinas <joeypabalinas(a)gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
Documentation/ABI/testing/sysfs-block-zram | 9 +++
Documentation/blockdev/zram.txt | 2
drivers/block/zram/zram_drv.c | 47 ++++++++++++++++++-
drivers/block/zram/zram_drv.h | 2
4 files changed, 59 insertions(+), 1 deletion(-)
--- a/Documentation/ABI/testing/sysfs-block-zram~zram-writeback-throttle
+++ a/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Description:
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
+
+What: /sys/block/zram<id>/writeback_limit
+Date: November 2018
+Contact: Minchan Kim <minchan(a)kernel.org>
+Description:
+ The writeback_limit file is read-write and specifies the maximum
+ amount of writeback ZRAM can do. The limit could be changed
+ in run time and "0" means disable the limit.
+ No limit is the initial state.
--- a/Documentation/blockdev/zram.txt~zram-writeback-throttle
+++ a/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WO trigger device r
mem_used_max WO reset the `mem_used_max' counter (see later)
mem_limit WO specifies the maximum amount of memory ZRAM can use
to store the compressed data
+writeback_limit WO specifies the maximum amount of write IO zram can
+ write out to backing device as 4KB unit
max_comp_streams RW the number of possible concurrent compress operations
comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction
--- a/drivers/block/zram/zram_drv.c~zram-writeback-throttle
+++ a/drivers/block/zram/zram_drv.c
@@ -330,6 +330,40 @@ next:
}
#ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct zram *zram = dev_to_zram(dev);
+ u64 val;
+ ssize_t ret = -EINVAL;
+
+ if (kstrtoull(buf, 10, &val))
+ return ret;
+
+ down_read(&zram->init_lock);
+ atomic64_set(&zram->stats.bd_wb_limit, val);
+ if (val == 0 || val > atomic64_read(&zram->stats.bd_writes))
+ zram->stop_writeback = false;
+ up_read(&zram->init_lock);
+ ret = len;
+
+ return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ u64 val;
+ struct zram *zram = dev_to_zram(dev);
+
+ down_read(&zram->init_lock);
+ val = atomic64_read(&zram->stats.bd_wb_limit);
+ up_read(&zram->init_lock);
+
+ return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
static void reset_bdev(struct zram *zram)
{
struct block_device *bdev;
@@ -571,6 +605,7 @@ static ssize_t writeback_store(struct de
char mode_buf[8];
unsigned long mode = -1UL;
unsigned long blk_idx = 0;
+ u64 wb_count, wb_limit;
sz = strscpy(mode_buf, buf, sizeof(mode_buf));
if (sz <= 0)
@@ -612,6 +647,11 @@ static ssize_t writeback_store(struct de
bvec.bv_len = PAGE_SIZE;
bvec.bv_offset = 0;
+ if (zram->stop_writeback) {
+ ret = -EIO;
+ break;
+ }
+
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
@@ -670,7 +710,7 @@ static ssize_t writeback_store(struct de
continue;
}
- atomic64_inc(&zram->stats.bd_writes);
+ wb_count = atomic64_inc_return(&zram->stats.bd_writes);
/*
* We released zram_slot_lock so need to check if the slot was
* changed. If there is freeing for the slot, we can catch it
@@ -694,6 +734,9 @@ static ssize_t writeback_store(struct de
zram_set_element(zram, index, blk_idx);
blk_idx = 0;
atomic64_inc(&zram->stats.pages_stored);
+ wb_limit = atomic64_read(&zram->stats.bd_wb_limit);
+ if (wb_limit != 0 && wb_count >= wb_limit)
+ zram->stop_writeback = true;
next:
zram_slot_unlock(zram, index);
}
@@ -1767,6 +1810,7 @@ static DEVICE_ATTR_RW(comp_algorithm);
#ifdef CONFIG_ZRAM_WRITEBACK
static DEVICE_ATTR_RW(backing_dev);
static DEVICE_ATTR_WO(writeback);
+static DEVICE_ATTR_RW(writeback_limit);
#endif
static struct attribute *zram_disk_attrs[] = {
@@ -1782,6 +1826,7 @@ static struct attribute *zram_disk_attrs
#ifdef CONFIG_ZRAM_WRITEBACK
&dev_attr_backing_dev.attr,
&dev_attr_writeback.attr,
+ &dev_attr_writeback_limit.attr,
#endif
&dev_attr_io_stat.attr,
&dev_attr_mm_stat.attr,
--- a/drivers/block/zram/zram_drv.h~zram-writeback-throttle
+++ a/drivers/block/zram/zram_drv.h
@@ -86,6 +86,7 @@ struct zram_stats {
atomic64_t bd_count; /* no. of pages in backing device */
atomic64_t bd_reads; /* no. of reads from backing device */
atomic64_t bd_writes; /* no. of writes from backing device */
+ atomic64_t bd_wb_limit; /* writeback limit of backing device */
#endif
};
@@ -113,6 +114,7 @@ struct zram {
*/
bool claim; /* Protected by bdev->bd_mutex */
struct file *backing_dev;
+ bool stop_writeback;
#ifdef CONFIG_ZRAM_WRITEBACK
struct block_device *bdev;
unsigned int old_block_size;
_
Patches currently in -mm which might be from minchan(a)kernel.org are
zram-fix-lockdep-warning-of-free-block-handling.patch
zram-fix-double-free-backing-device.patch
zram-refactoring-flags-and-writeback-stuff.patch
zram-introduce-zram_idle-flag.patch
zram-support-idle-huge-page-writeback.patch
zram-add-bd_stat-statistics.patch
zram-writeback-throttle.patch
The patch titled
Subject: zram: add bd_stat statistics
has been added to the -mm tree. Its filename is
zram-add-bd_stat-statistics.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/zram-add-bd_stat-statistics.patch
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/zram-add-bd_stat-statistics.patch
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Minchan Kim <minchan(a)kernel.org>
Subject: zram: add bd_stat statistics
bd_stat represents things that happened in the backing device. Currently
it supports bd_counts, bd_reads and bd_writes which are helpful to
understand wearout of flash and memory saving.
Link: http://lkml.kernel.org/r/20181127055429.251614-7-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan(a)kernel.org>
Cc: Joey Pabalinas <joeypabalinas(a)gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
Documentation/ABI/testing/sysfs-block-zram | 8 +++++
Documentation/blockdev/zram.txt | 11 +++++++
drivers/block/zram/zram_drv.c | 29 +++++++++++++++++++
drivers/block/zram/zram_drv.h | 5 +++
4 files changed, 53 insertions(+)
--- a/Documentation/ABI/testing/sysfs-block-zram~zram-add-bd_stat-statistics
+++ a/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact: Minchan Kim <minchan(a)kernel.org
Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What: /sys/block/zram<id>/bd_stat
+Date: November 2018
+Contact: Minchan Kim <minchan(a)kernel.org>
+Description:
+ The bd_stat file is read-only and represents backing device's
+ statistics (bd_count, bd_reads, bd_writes) in a format
+ similar to block layer statistics file format.
--- a/Documentation/blockdev/zram.txt~zram-add-bd_stat-statistics
+++ a/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ line of text and contains the following
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages
+File /sys/block/zram<id>/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count size of data written in backing device.
+ Unit: 4K bytes
+ bd_reads the number of reads from backing device
+ Unit: 4K bytes
+ bd_writes the number of writes to backing device
+ Unit: 4K bytes
+
9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
--- a/drivers/block/zram/zram_drv.c~zram-add-bd_stat-statistics
+++ a/drivers/block/zram/zram_drv.c
@@ -502,6 +502,7 @@ retry:
if (test_and_set_bit(blk_idx, zram->bitmap))
goto retry;
+ atomic64_inc(&zram->stats.bd_count);
return blk_idx;
}
@@ -511,6 +512,7 @@ static void free_block_bdev(struct zram
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
+ atomic64_dec(&zram->stats.bd_count);
}
static void zram_page_end_io(struct bio *bio)
@@ -668,6 +670,7 @@ static ssize_t writeback_store(struct de
continue;
}
+ atomic64_inc(&zram->stats.bd_writes);
/*
* We released zram_slot_lock so need to check if the slot was
* changed. If there is freeing for the slot, we can catch it
@@ -757,6 +760,7 @@ static int read_from_bdev_sync(struct zr
static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
{
+ atomic64_inc(&zram->stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -1013,6 +1017,25 @@ static ssize_t mm_stat_show(struct devic
return ret;
}
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct zram *zram = dev_to_zram(dev);
+ ssize_t ret;
+
+ down_read(&zram->init_lock);
+ ret = scnprintf(buf, PAGE_SIZE,
+ "%8llu %8llu %8llu\n",
+ (u64)atomic64_read(&zram->stats.bd_count),
+ (u64)atomic64_read(&zram->stats.bd_reads),
+ (u64)atomic64_read(&zram->stats.bd_writes));
+ up_read(&zram->init_lock);
+
+ return ret;
+}
+#endif
+
static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -1033,6 +1056,9 @@ static ssize_t debug_stat_show(struct de
static DEVICE_ATTR_RO(io_stat);
static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
static DEVICE_ATTR_RO(debug_stat);
static void zram_meta_free(struct zram *zram, u64 disksize)
@@ -1759,6 +1785,9 @@ static struct attribute *zram_disk_attrs
#endif
&dev_attr_io_stat.attr,
&dev_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_WRITEBACK
+ &dev_attr_bd_stat.attr,
+#endif
&dev_attr_debug_stat.attr,
NULL,
};
--- a/drivers/block/zram/zram_drv.h~zram-add-bd_stat-statistics
+++ a/drivers/block/zram/zram_drv.h
@@ -82,6 +82,11 @@ struct zram_stats {
atomic_long_t max_used_pages; /* no. of maximum pages stored */
atomic64_t writestall; /* no. of write slow paths */
atomic64_t miss_free; /* no. of missed free */
+#ifdef CONFIG_ZRAM_WRITEBACK
+ atomic64_t bd_count; /* no. of pages in backing device */
+ atomic64_t bd_reads; /* no. of reads from backing device */
+ atomic64_t bd_writes; /* no. of writes from backing device */
+#endif
};
struct zram {
_
Patches currently in -mm which might be from minchan(a)kernel.org are
zram-fix-lockdep-warning-of-free-block-handling.patch
zram-fix-double-free-backing-device.patch
zram-refactoring-flags-and-writeback-stuff.patch
zram-introduce-zram_idle-flag.patch
zram-support-idle-huge-page-writeback.patch
zram-add-bd_stat-statistics.patch
zram-writeback-throttle.patch
The patch titled
Subject: zram: support idle/huge page writeback
has been added to the -mm tree. Its filename is
zram-support-idle-huge-page-writeback.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/zram-support-idle-huge-page-writeb…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/zram-support-idle-huge-page-writeb…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Minchan Kim <minchan(a)kernel.org>
Subject: zram: support idle/huge page writeback
Add a new feature "zram idle/huge page writeback". In the zram-swap use
case, zram usually has many idle/huge swap pages. It's pointless to keep
them in memory (ie, zram).
To solve this problem, this feature introduces idle/huge page writeback to
the backing device so the goal is to save more memory space on embedded
systems.
Normal sequence to use idle/huge page writeback feature is as follows,
while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.
echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}
By per discussion:
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,
This patch removes direct incommpressibe page writeback feature
(d2afd25114f4 ("zram: write incompressible pages to backing device")) so
we could regard it as regression because incompressible pages don't go to
backing storage automatically. Instead, users should do this via "echo
huge" > /sys/block/zram/writeback" manually.
If we hear some regression, we could restore the function.
Link: http://lkml.kernel.org/r/20181127055429.251614-6-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan(a)kernel.org>
Reviewed-by: Joey Pabalinas <joeypabalinas(a)gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
Documentation/ABI/testing/sysfs-block-zram | 7
Documentation/blockdev/zram.txt | 28 +-
drivers/block/zram/Kconfig | 5
drivers/block/zram/zram_drv.c | 247 +++++++++++++------
drivers/block/zram/zram_drv.h | 1
5 files changed, 209 insertions(+), 79 deletions(-)
--- a/Documentation/ABI/testing/sysfs-block-zram~zram-support-idle-huge-page-writeback
+++ a/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Description:
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram<id>/block_state
+
+What: /sys/block/zram<id>/writeback
+Date: November 2018
+Contact: Minchan Kim <minchan(a)kernel.org>
+Description:
+ The writeback file is write-only and trigger idle and/or
+ huge page writeback to backing device.
--- a/Documentation/blockdev/zram.txt~zram-support-idle-huge-page-writeback
+++ a/Documentation/blockdev/zram.txt
@@ -238,11 +238,31 @@ line of text and contains the following
= writeback
-With incompressible pages, there is no memory saving with zram.
-Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
to backing storage rather than keeping it in memory.
-User should set up backing device via /sys/block/zramX/backing_dev
-before disksize setting.
+To use the feature, admin should set up backing device via
+
+ "echo /dev/sda5 > /sys/block/zramX/backing_dev"
+
+before disksize setting. It supports only partition at this moment.
+If admin want to use incompressible page writeback, they could do via
+
+ "echo huge > /sys/block/zramX/write"
+
+To use idle page writeback, first, user need to declare zram pages
+as idle.
+
+ "echo all > /sys/block/zramX/idle"
+
+From now on, any pages on zram are idle pages. The idle mark
+will be removed until someone request access of the block.
+IOW, unless there is access request, those pages are still idle pages.
+
+Admin can request writeback of those idle pages at right timing via
+
+ "echo idle > /sys/block/zramX/writeback"
+
+With the command, zram writeback idle pages from memory to the storage.
= memory tracking
--- a/drivers/block/zram/Kconfig~zram-support-idle-huge-page-writeback
+++ a/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
See Documentation/blockdev/zram.txt for more information.
config ZRAM_WRITEBACK
- bool "Write back incompressible page to backing device"
+ bool "Write back incompressible or idle page to backing device"
depends on ZRAM
help
With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
For this feature, admin should set up backing device via
/sys/block/zramX/backing_dev.
+ With /sys/block/zramX/{idle,writeback}, application could ask
+ idle page's writeback to the backing device to save in memory.
+
See Documentation/blockdev/zram.txt for more information.
config ZRAM_MEMORY_TRACKING
--- a/drivers/block/zram/zram_drv.c~zram-support-idle-huge-page-writeback
+++ a/drivers/block/zram/zram_drv.c
@@ -52,6 +52,9 @@ static unsigned int num_devices = 1;
static size_t huge_class_size;
static void zram_free_page(struct zram *zram, size_t index);
+static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
+ u32 index, int offset, struct bio *bio);
+
static int zram_slot_trylock(struct zram *zram, u32 index)
{
@@ -73,13 +76,6 @@ static inline bool init_done(struct zram
return zram->disksize;
}
-static inline bool zram_allocated(struct zram *zram, u32 index)
-{
-
- return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
- zram->table[index].handle;
-}
-
static inline struct zram *dev_to_zram(struct device *dev)
{
return (struct zram *)dev_to_disk(dev)->private_data;
@@ -138,6 +134,13 @@ static void zram_set_obj_size(struct zra
zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
}
+static inline bool zram_allocated(struct zram *zram, u32 index)
+{
+ return zram_get_obj_size(zram, index) ||
+ zram_test_flag(zram, index, ZRAM_SAME) ||
+ zram_test_flag(zram, index, ZRAM_WB);
+}
+
#if PAGE_SIZE != 4096
static inline bool is_partial_io(struct bio_vec *bvec)
{
@@ -308,10 +311,14 @@ static ssize_t idle_store(struct device
}
for (index = 0; index < nr_pages; index++) {
+ /*
+ * Do not mark ZRAM_UNDER_WB slot as ZRAM_IDLE to close race.
+ * See the comment in writeback_store.
+ */
zram_slot_lock(zram, index);
- if (!zram_allocated(zram, index))
+ if (!zram_allocated(zram, index) ||
+ zram_test_flag(zram, index, ZRAM_UNDER_WB))
goto next;
-
zram_set_flag(zram, index, ZRAM_IDLE);
next:
zram_slot_unlock(zram, index);
@@ -546,6 +553,158 @@ static int read_from_bdev_async(struct z
return 1;
}
+#define HUGE_WRITEBACK 0x1
+#define IDLE_WRITEBACK 0x2
+
+static ssize_t writeback_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct zram *zram = dev_to_zram(dev);
+ unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+ unsigned long index;
+ struct bio bio;
+ struct bio_vec bio_vec;
+ struct page *page;
+ ssize_t ret, sz;
+ char mode_buf[8];
+ unsigned long mode = -1UL;
+ unsigned long blk_idx = 0;
+
+ sz = strscpy(mode_buf, buf, sizeof(mode_buf));
+ if (sz <= 0)
+ return -EINVAL;
+
+ /* ignore trailing newline */
+ if (mode_buf[sz - 1] == '\n')
+ mode_buf[sz - 1] = 0x00;
+
+ if (!strcmp(mode_buf, "idle"))
+ mode = IDLE_WRITEBACK;
+ else if (!strcmp(mode_buf, "huge"))
+ mode = HUGE_WRITEBACK;
+
+ if (mode == -1UL)
+ return -EINVAL;
+
+ down_read(&zram->init_lock);
+ if (!init_done(zram)) {
+ ret = -EINVAL;
+ goto release_init_lock;
+ }
+
+ if (!zram->backing_dev) {
+ ret = -ENODEV;
+ goto release_init_lock;
+ }
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page) {
+ ret = -ENOMEM;
+ goto release_init_lock;
+ }
+
+ for (index = 0; index < nr_pages; index++) {
+ struct bio_vec bvec;
+
+ bvec.bv_page = page;
+ bvec.bv_len = PAGE_SIZE;
+ bvec.bv_offset = 0;
+
+ if (!blk_idx) {
+ blk_idx = alloc_block_bdev(zram);
+ if (!blk_idx) {
+ ret = -ENOSPC;
+ break;
+ }
+ }
+
+ zram_slot_lock(zram, index);
+ if (!zram_allocated(zram, index))
+ goto next;
+
+ if (zram_test_flag(zram, index, ZRAM_WB) ||
+ zram_test_flag(zram, index, ZRAM_SAME) ||
+ zram_test_flag(zram, index, ZRAM_UNDER_WB))
+ goto next;
+
+ if ((mode & IDLE_WRITEBACK &&
+ !zram_test_flag(zram, index, ZRAM_IDLE)) &&
+ (mode & HUGE_WRITEBACK &&
+ !zram_test_flag(zram, index, ZRAM_HUGE)))
+ goto next;
+ /*
+ * Clearing ZRAM_UNDER_WB is duty of caller.
+ * IOW, zram_free_page never clear it.
+ */
+ zram_set_flag(zram, index, ZRAM_UNDER_WB);
+ /* Need for hugepage writeback racing */
+ zram_set_flag(zram, index, ZRAM_IDLE);
+ zram_slot_unlock(zram, index);
+ if (zram_bvec_read(zram, &bvec, index, 0, NULL)) {
+ zram_slot_lock(zram, index);
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+ zram_slot_unlock(zram, index);
+ continue;
+ }
+
+ bio_init(&bio, &bio_vec, 1);
+ bio_set_dev(&bio, zram->bdev);
+ bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
+ bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
+
+ bio_add_page(&bio, bvec.bv_page, bvec.bv_len,
+ bvec.bv_offset);
+ /*
+ * XXX: A single page IO would be inefficient for write
+ * but it would be not bad as starter.
+ */
+ ret = submit_bio_wait(&bio);
+ if (ret) {
+ zram_slot_lock(zram, index);
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+ zram_slot_unlock(zram, index);
+ continue;
+ }
+
+ /*
+ * We released zram_slot_lock so need to check if the slot was
+ * changed. If there is freeing for the slot, we can catch it
+ * easily by zram_allocated.
+ * A subtle case is the slot is freed/reallocated/marked as
+ * ZRAM_IDLE again. To close the race, idle_store doesn't
+ * mark ZRAM_IDLE once it found the slot was ZRAM_UNDER_WB.
+ * Thus, we could close the race by checking ZRAM_IDLE bit.
+ */
+ zram_slot_lock(zram, index);
+ if (!zram_allocated(zram, index) ||
+ !zram_test_flag(zram, index, ZRAM_IDLE)) {
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+ goto next;
+ }
+
+ zram_free_page(zram, index);
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_set_flag(zram, index, ZRAM_WB);
+ zram_set_element(zram, index, blk_idx);
+ blk_idx = 0;
+ atomic64_inc(&zram->stats.pages_stored);
+next:
+ zram_slot_unlock(zram, index);
+ }
+
+ if (blk_idx)
+ free_block_bdev(zram, blk_idx);
+ ret = len;
+ __free_page(page);
+release_init_lock:
+ up_read(&zram->init_lock);
+
+ return ret;
+}
+
struct zram_work {
struct work_struct work;
struct zram *zram;
@@ -603,57 +762,8 @@ static int read_from_bdev(struct zram *z
else
return read_from_bdev_async(zram, bvec, entry, parent);
}
-
-static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
- u32 index, struct bio *parent,
- unsigned long *pentry)
-{
- struct bio *bio;
- unsigned long entry;
-
- bio = bio_alloc(GFP_ATOMIC, 1);
- if (!bio)
- return -ENOMEM;
-
- entry = alloc_block_bdev(zram);
- if (!entry) {
- bio_put(bio);
- return -ENOSPC;
- }
-
- bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
- bio_set_dev(bio, zram->bdev);
- if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len,
- bvec->bv_offset)) {
- bio_put(bio);
- free_block_bdev(zram, entry);
- return -EIO;
- }
-
- if (!parent) {
- bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
- bio->bi_end_io = zram_page_end_io;
- } else {
- bio->bi_opf = parent->bi_opf;
- bio_chain(bio, parent);
- }
-
- submit_bio(bio);
- *pentry = entry;
-
- return 0;
-}
-
#else
static inline void reset_bdev(struct zram *zram) {};
-static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
- u32 index, struct bio *parent,
- unsigned long *pentry)
-
-{
- return -EIO;
-}
-
static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
{
@@ -1006,7 +1116,8 @@ out:
atomic64_dec(&zram->stats.pages_stored);
zram_set_handle(zram, index, 0);
zram_set_obj_size(zram, index, 0);
- WARN_ON_ONCE(zram->table[index].flags & ~(1UL << ZRAM_LOCK));
+ WARN_ON_ONCE(zram->table[index].flags &
+ ~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
}
static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
@@ -1115,7 +1226,6 @@ static int __zram_bvec_write(struct zram
struct page *page = bvec->bv_page;
unsigned long element = 0;
enum zram_pageflags flags = 0;
- bool allow_wb = true;
mem = kmap_atomic(page);
if (page_same_filled(mem, &element)) {
@@ -1140,21 +1250,8 @@ compress_again:
return ret;
}
- if (unlikely(comp_len >= huge_class_size)) {
+ if (comp_len >= huge_class_size)
comp_len = PAGE_SIZE;
- if (zram->backing_dev && allow_wb) {
- zcomp_stream_put(zram->comp);
- ret = write_to_bdev(zram, bvec, index, bio, &element);
- if (!ret) {
- flags = ZRAM_WB;
- ret = 1;
- goto out;
- }
- allow_wb = false;
- goto compress_again;
- }
- }
-
/*
* handle allocation has 2 paths:
* a) fast path is executed with preemption disabled (for
@@ -1643,6 +1740,7 @@ static DEVICE_ATTR_RW(max_comp_streams);
static DEVICE_ATTR_RW(comp_algorithm);
#ifdef CONFIG_ZRAM_WRITEBACK
static DEVICE_ATTR_RW(backing_dev);
+static DEVICE_ATTR_WO(writeback);
#endif
static struct attribute *zram_disk_attrs[] = {
@@ -1657,6 +1755,7 @@ static struct attribute *zram_disk_attrs
&dev_attr_comp_algorithm.attr,
#ifdef CONFIG_ZRAM_WRITEBACK
&dev_attr_backing_dev.attr,
+ &dev_attr_writeback.attr,
#endif
&dev_attr_io_stat.attr,
&dev_attr_mm_stat.attr,
--- a/drivers/block/zram/zram_drv.h~zram-support-idle-huge-page-writeback
+++ a/drivers/block/zram/zram_drv.h
@@ -47,6 +47,7 @@ enum zram_pageflags {
ZRAM_LOCK = ZRAM_FLAG_SHIFT,
ZRAM_SAME, /* Page consists the same element */
ZRAM_WB, /* page is stored on backing_device */
+ ZRAM_UNDER_WB, /* page is under writeback */
ZRAM_HUGE, /* Incompressible page */
ZRAM_IDLE, /* not accessed page since last idle marking */
_
Patches currently in -mm which might be from minchan(a)kernel.org are
zram-fix-lockdep-warning-of-free-block-handling.patch
zram-fix-double-free-backing-device.patch
zram-refactoring-flags-and-writeback-stuff.patch
zram-introduce-zram_idle-flag.patch
zram-support-idle-huge-page-writeback.patch
zram-add-bd_stat-statistics.patch
zram-writeback-throttle.patch