Felix Abecassis reports move_pages() would return random status if the
pages are already on the target node by the below test program:
---8<---
int main(void)
{
const long node_id = 1;
const long page_size = sysconf(_SC_PAGESIZE);
const int64_t num_pages = 8;
unsigned long nodemask = 1 << node_id;
long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
if (ret < 0)
return (EXIT_FAILURE);
void **pages = malloc(sizeof(void*) * num_pages);
for (int i = 0; i < num_pages; ++i) {
pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
-1, 0);
if (pages[i] == MAP_FAILED)
return (EXIT_FAILURE);
}
ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
if (ret < 0)
return (EXIT_FAILURE);
int *nodes = malloc(sizeof(int) * num_pages);
int *status = malloc(sizeof(int) * num_pages);
for (int i = 0; i < num_pages; ++i) {
nodes[i] = node_id;
status[i] = 0xd0; /* simulate garbage values */
}
ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
printf("move_pages: %ld\n", ret);
for (int i = 0; i < num_pages; ++i)
printf("status[%d] = %d\n", i, status[i]);
}
---8<---
Then running the program would return nonsense status values:
$ ./move_pages_bug
move_pages: 0
status[0] = 208
status[1] = 208
status[2] = 208
status[3] = 208
status[4] = 208
status[5] = 208
status[6] = 208
status[7] = 208
This is because the status is not set if the page is already on the
target node, but move_pages() should return valid status as long as it
succeeds. The valid status may be errno or node id.
We can't simply initialize status array to zero since the pages may be
not on node 0. Fix it by updating status with node id which the page is
already on. And, it looks we have to update the status inside
add_page_for_migration() since the page struct is not available outside
it.
Make add_page_for_migration() return 1 if store_status() is failed in
order to not mix up the status value since -EFAULT is also a valid
status.
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Reported-by: Felix Abecassis <fabecassis(a)nvidia.com>
Tested-by: Felix Abecassis <fabecassis(a)nvidia.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org> 4.17+
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
---
v2: *Correted the return value when add_page_for_migration() returns 1.
John noticed another return value inconsistency between the implementation and
the manpage. The manpage says it should return -ENOENT if the page is already
on the target node, but it doesn't. It looks the original code didn't return
-ENOENT either, I'm not sure if this is a document issue or not. Anyway this
is another issue, once we confirm it we can fix it later.
mm/migrate.c | 36 ++++++++++++++++++++++++++++++------
1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index a8f87cb..f1090a0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1512,17 +1512,21 @@ static int do_move_pages_to_node(struct mm_struct *mm,
/*
* Resolves the given address to a struct page, isolates it from the LRU and
* puts it to the given pagelist.
- * Returns -errno if the page cannot be found/isolated or 0 when it has been
- * queued or the page doesn't need to be migrated because it is already on
- * the target node
+ * Returns:
+ * errno - if the page cannot be found/isolated
+ * 0 - when it has been queued or the page doesn't need to be migrated
+ * because it is already on the target node
+ * 1 - if store_status() is failed
*/
static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
- int node, struct list_head *pagelist, bool migrate_all)
+ int node, struct list_head *pagelist, bool migrate_all,
+ int __user *status, int start)
{
struct vm_area_struct *vma;
struct page *page;
unsigned int follflags;
int err;
+ bool same_node = false;
down_read(&mm->mmap_sem);
err = -EFAULT;
@@ -1543,8 +1547,10 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
goto out;
err = 0;
- if (page_to_nid(page) == node)
+ if (page_to_nid(page) == node) {
+ same_node = true;
goto out_putpage;
+ }
err = -EACCES;
if (page_mapcount(page) > 1 && !migrate_all)
@@ -1578,6 +1584,16 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
put_page(page);
out:
up_read(&mm->mmap_sem);
+
+ /*
+ * Must call store_status() after releasing mmap_sem since put_user
+ * need acquire mmap_sem too, otherwise potential deadlock may exist.
+ */
+ if (same_node) {
+ if (store_status(status, start, node, 1))
+ err = 1;
+ }
+
return err;
}
@@ -1639,10 +1655,18 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
* report them via status
*/
err = add_page_for_migration(mm, addr, current_node,
- &pagelist, flags & MPOL_MF_MOVE_ALL);
+ &pagelist, flags & MPOL_MF_MOVE_ALL, status,
+ i);
+
if (!err)
continue;
+ /* store_status() failed in add_page_for_migration() */
+ if (err > 0) {
+ err = -EFAULT;
+ goto out_flush;
+ }
+
err = store_status(status, i, err, 1);
if (err)
goto out_flush;
--
1.8.3.1
Since commit 0a432dcbeb32edcd211a5d8f7847d0da7642a8b4 ("mm: shrinker:
make shrinker not depend on memcg kmem"), shrinkers' idr is protected by
CONFIG_MEMCG instead of CONFIG_MEMCG_KMEM, so it makes no sense to
protect shrinker idr replace with CONFIG_MEMCG_KMEM.
And, in CONFIG_MEMCG && CONFIG_SLOB case, shrinker_idr contains only
shrinker, and it is deferred_split_shrinker. But it is never actually
called, since idr_replace() is never compiled due to the wrong #ifdef.
The deferred_split_shrinker all the time is staying in half-registered
state, and it's never called for subordinate mem cgroups.
Fixes: 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on memcg kmem")
Reviewed-by: Kirill Tkhai <ktkhai(a)virtuozzo.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: <stable(a)vger.kernel.org> 5.4+
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee4eecc..e7f10c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -422,7 +422,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
list_add_tail(&shrinker->list, &shrinker_list);
-#ifdef CONFIG_MEMCG_KMEM
+#ifdef CONFIG_MEMCG
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
idr_replace(&shrinker_idr, shrinker, shrinker->id);
#endif
--
1.8.3.1
I'm announcing the release of the 4.4.206 kernel.
All users of the 4.4 kernel series must upgrade.
The updated 4.4.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.4.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/hid/uhid.txt | 2
Makefile | 2
arch/arm/Kconfig.debug | 28 +++++------
arch/arm/boot/dts/imx53-voipac-dmm-668.dtsi | 8 ---
arch/arm/mach-ks8695/board-acs5k.c | 2
arch/arm64/kernel/smp.c | 1
arch/microblaze/Makefile | 12 ++--
arch/microblaze/boot/Makefile | 4 -
arch/openrisc/kernel/entry.S | 2
arch/openrisc/kernel/head.S | 2
arch/powerpc/boot/dts/bamboo.dts | 4 +
arch/powerpc/kernel/prom.c | 6 +-
arch/powerpc/mm/fault.c | 17 +++----
arch/powerpc/mm/ppc_mmu_32.c | 4 -
arch/powerpc/platforms/pseries/dlpar.c | 4 +
arch/powerpc/xmon/xmon.c | 2
arch/s390/kvm/kvm-s390.c | 17 +++++--
arch/um/Kconfig.debug | 1
crypto/crypto_user.c | 37 ++++++++-------
drivers/acpi/acpi_lpss.c | 7 --
drivers/acpi/apei/ghes.c | 30 ++++++------
drivers/block/drbd/drbd_main.c | 1
drivers/block/drbd/drbd_nl.c | 6 +-
drivers/block/drbd/drbd_receiver.c | 19 +++++++
drivers/block/drbd/drbd_state.h | 2
drivers/char/hw_random/stm32-rng.c | 8 +++
drivers/clk/samsung/clk-exynos5420.c | 6 ++
drivers/hid/hid-core.c | 51 ++++++++++++++++++---
drivers/infiniband/hw/qib/qib_sdma.c | 4 +
drivers/infiniband/ulp/srp/ib_srp.c | 1
drivers/input/serio/gscps2.c | 4 -
drivers/input/serio/hp_sdc.c | 4 -
drivers/media/v4l2-core/v4l2-ctrls.c | 1
drivers/misc/mei/bus.c | 9 ++-
drivers/mtd/mtdcore.h | 2
drivers/mtd/mtdpart.c | 35 ++++++++++++--
drivers/mtd/ubi/build.c | 2
drivers/mtd/ubi/kapi.c | 2
drivers/net/can/c_can/c_can.c | 26 ++++++++++
drivers/net/can/usb/peak_usb/pcan_usb.c | 15 ++++--
drivers/net/ethernet/atheros/atl1e/atl1e_main.c | 4 +
drivers/net/ethernet/cadence/macb.c | 12 ++--
drivers/net/ethernet/sfc/ef10.c | 29 ++++++++---
drivers/net/ethernet/stmicro/stmmac/dwmac-sunxi.c | 4 +
drivers/net/macvlan.c | 3 -
drivers/net/slip/slip.c | 1
drivers/net/wireless/ath/ath6kl/cfg80211.c | 4 -
drivers/net/wireless/mwifiex/debugfs.c | 14 ++---
drivers/net/wireless/mwifiex/scan.c | 18 ++++---
drivers/net/wireless/realtek/rtl818x/rtl8187/dev.c | 3 -
drivers/pinctrl/sh-pfc/pfc-sh7264.c | 9 ++-
drivers/pinctrl/sh-pfc/pfc-sh7734.c | 16 +++---
drivers/platform/x86/hp-wmi.c | 6 +-
drivers/power/avs/smartreflex.c | 3 -
drivers/pwm/core.c | 1
drivers/pwm/pwm-samsung.c | 1
drivers/regulator/palmas-regulator.c | 5 +-
drivers/regulator/tps65910-regulator.c | 4 +
drivers/scsi/csiostor/csio_init.c | 2
drivers/scsi/libsas/sas_expander.c | 29 +++++++++++
drivers/scsi/lpfc/lpfc_scsi.c | 18 +++++++
drivers/scsi/qla2xxx/tcm_qla2xxx.c | 48 +++----------------
drivers/scsi/qla2xxx/tcm_qla2xxx.h | 3 -
drivers/staging/rtl8192e/rtl8192e/rtl_core.c | 5 +-
drivers/tty/serial/max310x.c | 7 --
drivers/usb/serial/ftdi_sio.c | 3 +
drivers/usb/serial/ftdi_sio_ids.h | 7 ++
drivers/xen/xen-pciback/pci_stub.c | 3 -
fs/btrfs/delayed-ref.c | 3 -
fs/gfs2/bmap.c | 2
fs/ocfs2/journal.c | 6 --
fs/xfs/xfs_ioctl32.c | 6 ++
fs/xfs/xfs_rtalloc.c | 4 -
include/linux/gpio/consumer.h | 2
include/linux/netdevice.h | 2
include/linux/reset-controller.h | 2
include/net/sock.h | 2
lib/genalloc.c | 5 +-
net/core/neighbour.c | 13 +++--
net/core/net_namespace.c | 3 -
net/core/sock.c | 2
net/decnet/dn_dev.c | 2
net/openvswitch/datapath.c | 17 +++++--
net/sched/sch_mq.c | 2
net/sched/sch_mqprio.c | 3 -
net/sched/sch_multiq.c | 2
net/sched/sch_prio.c | 2
net/tipc/link.c | 2
net/tipc/netlink_compat.c | 8 ++-
net/vmw_vsock/af_vsock.c | 7 ++
scripts/gdb/linux/symbols.py | 3 -
sound/core/compress_offload.c | 2
sound/soc/kirkwood/kirkwood-i2s.c | 8 +--
93 files changed, 492 insertions(+), 270 deletions(-)
Aditya Pakki (1):
net/net_namespace: Check the return value of register_pernet_subsys()
Alexander Shiyan (1):
serial: max310x: Fix tx_empty() callback
Alexander Usyskin (1):
mei: bus: prefix device names on bus with the bus name
Anatoliy Glagolev (1):
scsi: qla2xxx: deadlock by configfs_depend_item
Andy Shevchenko (1):
net: dev: Use unsigned integer as an argument to left-shift
Arnd Bergmann (1):
ARM: ks8695: fix section mismatch warning
Bart Van Assche (1):
RDMA/srp: Propagate ib_post_send() failures to the SCSI mid-layer
Benjamin Herrenschmidt (1):
powerpc/44x/bamboo: Fix PCI range
Bert Kenward (1):
sfc: initialise found bitmap in efx_ef10_mtd_probe
Bob Peterson (1):
gfs2: take jdata unstuff into account in do_grow
Boris Brezillon (2):
mtd: Check add_mtd_device() ret code
mtd: Remove a debug trace in mtdpart.c
Brian Norris (1):
mwifiex: debugfs: correct histogram spacing, formatting
Candle Sun (1):
HID: core: check whether Usage Page item is after Usage ID items
Christophe Leroy (4):
powerpc/book3s/32: fix number of bats in p/v_block_mapped()
powerpc/xmon: fix dump_segments()
powerpc/prom: fix early DEBUG messages
powerpc/mm: Make NULL pointer deferences explicit on bad page faults.
Dan Carpenter (2):
block: drbd: remove a stray unlock in __drbd_send_protocol()
IB/qib: Fix an error code in qib_sdma_verbs_send()
Darrick J. Wong (1):
xfs: require both realtime inodes to mount
Dust Li (1):
net: sched: fix `tc -s class show` no bstats on class with nolock subqueues
Edward Cree (1):
sfc: suppress duplicate nvmem partition types in efx_ef10_mtd_probe
Eric Biggers (1):
crypto: user - support incremental algorithm dumps
Eric Dumazet (1):
net: fix possible overflow in __sk_mem_raise_allocated()
Eugen Hristev (1):
media: v4l2-ctrl: fix flags for DO_WHITE_BALANCE
Fabio D'Urso (1):
USB: serial: ftdi_sio: add device IDs for U-Blox C099-F9P
Fabio Estevam (1):
ARM: dts: imx53-voipac-dmm-668: Fix memory node duplication
Geert Uytterhoeven (3):
pinctrl: sh-pfc: sh7264: Fix PFCR3 and PFCR0 register configuration
pinctrl: sh-pfc: sh7734: Fix shifted values in IPSR10
openrisc: Fix broken paths to arch/or32
Gen Zhang (1):
powerpc/pseries/dlpar: Fix a missing check in dlpar_parse_cc_property()
Greg Kroah-Hartman (1):
Linux 4.4.206
Gustavo A. R. Silva (1):
tipc: fix memory leak in tipc_nl_compat_publ_dump
Hans de Goede (2):
ACPI / LPSS: Ignore acpi_device_fix_up_power() return value
platform/x86: hp-wmi: Fix ACPI errors caused by too small buffer
Helge Deller (2):
parisc: Fix serio address output
parisc: Fix HP SDC hpa address output
Hoang Le (1):
tipc: fix skb may be leaky in tipc_link_input
Huang Shijie (1):
lib/genalloc.c: use vzalloc_node() to allocate the bitmap
Ilya Leoshkevich (1):
scripts/gdb: fix debugging modules compiled with hot/cold partitioning
James Morse (1):
ACPI / APEI: Switch estatus pool to use vmalloc memory
James Smart (1):
scsi: lpfc: Fix dif and first burst use in write commands
Jeroen Hofstee (2):
can: peak_usb: report bus recovery as well
can: c_can: D_CAN: c_can_chip_config(): perform a sofware reset on open
Johannes Berg (1):
decnet: fix DN_IFREQ_SIZE
John Garry (2):
scsi: libsas: Support SATA PHY connection rate unmatch fixing during discovery
scsi: libsas: Check SMP PHY control function result
John Rutherford (1):
tipc: fix link name length check
Josef Bacik (1):
btrfs: only track ref_heads in delayed_ref_updates
Jouni Hogander (1):
slip: Fix use-after-free Read in slip_open
Junxiao Bi (1):
ocfs2: clear journal dirty flag after shutdown journal
Kangjie Lu (5):
drivers/regulator: fix a missing check of return value
regulator: tps65910: fix a missing check of return value
net: stmicro: fix a missing check of clk_prepare
atl1e: checking the status of atl1e_write_phy_reg
tipc: fix a missing check of genlmsg_put
Konstantin Khlebnikov (2):
net/core/neighbour: tell kmemleak about hash tables
net/core/neighbour: fix kmemleak minimal reference count for hash tables
Krzysztof Kozlowski (1):
gpiolib: Fix return value of gpio_to_desc() stub if !GPIOLIB
Kyle Roeschley (2):
ath6kl: Only use match sets when firmware supports it
ath6kl: Fix off by one error in scan completion
Lars Ellenberg (1):
drbd: reject attach of unsuitable uuids even if connected
Lepton Wu (1):
VSOCK: bind to random port for VMADDR_PORT_ANY
Lionel Debieve (1):
hwrng: stm32 - fix unbalanced pm_runtime_enable
Luc Van Oostenryck (1):
drbd: fix print_st_err()'s prototype to match the definition
Luca Ceresoli (1):
net: macb: fix error format in dev_err()
Marek Szyprowski (1):
clk: samsung: exynos5420: Preserve PLL configuration during suspend/resume
Masahiro Yamada (2):
microblaze: adjust the help to the real behavior
microblaze: move "... is ready" messages to arch/microblaze/Makefile
Menglong Dong (1):
macvlan: schedule bc_work even if error
Michael Mueller (1):
KVM: s390: unregister debug feature on failing arch init
Nick Bowler (1):
xfs: Align compat attrlist_by_handle with native implementation.
Olof Johansson (1):
lib/genalloc.c: include vmalloc.h
Pan Bian (5):
mwifiex: fix potential NULL dereference and use after free
rtl818x: fix potential use after free
ubi: Put MTD device after it is not used
ubi: Do not drop UBI device reference before using
staging: rtl8192e: fix potential use after free
Paolo Abeni (3):
openvswitch: fix flow command message size
openvswitch: drop unneeded BUG_ON() in ovs_flow_cmd_build_info()
openvswitch: remove another BUG_ON()
Peter Hutterer (1):
HID: doc: fix wrong data structure reference for UHID_OUTPUT
Randy Dunlap (1):
reset: fix reset_control_ops kerneldoc comment
Richard Weinberger (1):
um: Make GCOV depend on !KCOV
Ross Lagerwall (1):
xen/pciback: Check dev_data before using it
Russell King (1):
ASoC: kirkwood: fix external clock probe defer
Suzuki K Poulose (1):
arm64: smp: Handle errors reported by the firmware
Thomas Meyer (1):
PM / AVS: SmartReflex: NULL check before some freeing functions is not needed
Uwe Kleine-König (2):
ARM: debug-imx: only define DEBUG_IMX_UART_PORT if needed
pwm: Clear chip_data in pwm_put()
Varun Prakash (1):
scsi: csiostor: fix incorrect dma device in case of vport
Xiaojun Sang (1):
ASoC: compress: fix unsigned integer overflow check