The patch titled
Subject: radix tree: fix multi-order iteration race
has been added to the -mm tree. Its filename is
radix-tree-fix-multi-order-iteration-race.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/radix-tree-fix-multi-order-iterati…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/radix-tree-fix-multi-order-iterati…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Subject: radix tree: fix multi-order iteration race
Fix a race in the multi-order iteration code which causes the kernel to
hit a GP fault. This was first seen with a production v4.15 based kernel
(4.15.6-300.fc27.x86_64) utilizing a DAX workload which used order 9 PMD
DAX entries.
The race has to do with how we tear down multi-order sibling entries when
we are removing an item from the tree. Remember for example that an order
2 entry looks like this:
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
where 'entry' is in some slot in the struct radix_tree_node, and the three
slots following 'entry' contain sibling pointers which point back to
'entry.'
When we delete 'entry' from the tree, we call :
radix_tree_delete()
radix_tree_delete_item()
__radix_tree_delete()
replace_slot()
replace_slot() first removes the siblings in order from the first to the
last, then at then replaces 'entry' with NULL. This means that for a
brief period of time we end up with one or more of the siblings removed,
so:
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
This causes an issue if you have a reader iterating over the slots in the
tree via radix_tree_for_each_slot() while only under
rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
mm/filemap.c.
The issue is that when __radix_tree_next_slot() => skip_siblings() tries
to skip over the sibling entries in the slots, it currently does so with
an exact match on the slot directly preceding our current slot. Normally
this works:
V preceding slot
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
^ current slot
This lets you find the first sibling, and you skip them all in order.
But in the case where one of the siblings is NULL, that slot is skipped
and then our sibling detection is interrupted:
V preceding slot
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
^ current slot
This means that the sibling pointers aren't recognized since they point
all the way back to 'entry', so we think that they are normal internal
radix tree pointers. This causes us to think we need to walk down to a
struct radix_tree_node starting at the address of 'entry'.
In a real running kernel this will crash the thread with a GP fault when
you try and dereference the slots in your broken node starting at 'entry'.
We fix this race by fixing the way that skip_siblings() detects sibling
nodes. Instead of testing against the preceding slot we instead look for
siblings via is_sibling_entry() which compares against the position of the
struct radix_tree_node.slots[] array. This ensures that sibling entries
are properly identified, even if they are no longer contiguous with the
'entry' they point to.
Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com
Fixes: 148deab223b2 ("radix-tree: improve multiorder iterators")
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reported-by: CR, Sapthagirish <sapthagirish.cr(a)intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Dave Chinner <david(a)fromorbit.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/radix-tree.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff -puN lib/radix-tree.c~radix-tree-fix-multi-order-iteration-race lib/radix-tree.c
--- a/lib/radix-tree.c~radix-tree-fix-multi-order-iteration-race
+++ a/lib/radix-tree.c
@@ -1612,11 +1612,9 @@ static void set_iter_tags(struct radix_t
static void __rcu **skip_siblings(struct radix_tree_node **nodep,
void __rcu **slot, struct radix_tree_iter *iter)
{
- void *sib = node_to_entry(slot - 1);
-
while (iter->index < iter->next_index) {
*nodep = rcu_dereference_raw(*slot);
- if (*nodep && *nodep != sib)
+ if (*nodep && !is_sibling_entry(iter->node, *nodep))
return slot;
slot++;
iter->index = __radix_tree_iter_add(iter, 1);
@@ -1631,7 +1629,7 @@ void __rcu **__radix_tree_next_slot(void
struct radix_tree_iter *iter, unsigned flags)
{
unsigned tag = flags & RADIX_TREE_ITER_TAG_MASK;
- struct radix_tree_node *node = rcu_dereference_raw(*slot);
+ struct radix_tree_node *node;
slot = skip_siblings(&node, slot, iter);
_
Patches currently in -mm which might be from ross.zwisler(a)linux.intel.com are
radix-tree-test-suite-fix-mapshift-build-target.patch
radix-tree-test-suite-fix-compilation-issue.patch
radix-tree-test-suite-add-item_delete_rcu.patch
radix-tree-test-suite-multi-order-iteration-race.patch
radix-tree-fix-multi-order-iteration-race.patch
Hi Greg,
Could you please backport the below commit to 4.9 stable tree. With a CPU bounded workqueue we see much lower cpu utilization and the same IOPs performance.
commit b7363e67b23e04c23c2a99437feefac7292a88bc
Author: Sagi Grimberg <sagi(a)grimberg.me>
Date: Wed Mar 8 22:03:17 2017 +0200
IB/device: Convert ib-comp-wq to be CPU-bound
Thanks,
Raju
From: Lukas Wunner <lukas(a)wunner.de>
When sending packets as fast as possible using "cangen -g 0 -i -x", the
HI-3110 occasionally latches the interrupt pin high on completion of a
packet, but doesn't set the TXCPLT bit in the INTF register. The INTF
register contains 0x00 as if no interrupt has occurred. Even waiting
for a few milliseconds after the interrupt doesn't help.
Work around this apparent erratum by instead checking the TXMTY bit in
the STATF register ("TX FIFO empty"). We know that we've queued up a
packet for transmission if priv->tx_len is nonzero. If the TX FIFO is
empty, transmission of that packet must have completed.
Note that this is congruent with our handling of received packets, which
likewise gleans from the STATF register whether a packet is waiting in
the RX FIFO, instead of looking at the INTF register.
Cc: Mathias Duckeck <m.duckeck(a)kunbus.de>
Cc: Akshay Bhat <akshay.bhat(a)timesys.com>
Cc: Casey Fitzpatrick <casey.fitzpatrick(a)timesys.com>
Cc: stable(a)vger.kernel.org # v4.12+
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Acked-by: Akshay Bhat <akshay.bhat(a)timesys.com>
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
---
drivers/net/can/spi/hi311x.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/net/can/spi/hi311x.c b/drivers/net/can/spi/hi311x.c
index c2cf254e4e95..53e320c92a8b 100644
--- a/drivers/net/can/spi/hi311x.c
+++ b/drivers/net/can/spi/hi311x.c
@@ -91,6 +91,7 @@
#define HI3110_STAT_BUSOFF BIT(2)
#define HI3110_STAT_ERRP BIT(3)
#define HI3110_STAT_ERRW BIT(4)
+#define HI3110_STAT_TXMTY BIT(7)
#define HI3110_BTR0_SJW_SHIFT 6
#define HI3110_BTR0_BRP_SHIFT 0
@@ -737,10 +738,7 @@ static irqreturn_t hi3110_can_ist(int irq, void *dev_id)
}
}
- if (intf == 0)
- break;
-
- if (intf & HI3110_INT_TXCPLT) {
+ if (priv->tx_len && statf & HI3110_STAT_TXMTY) {
net->stats.tx_packets++;
net->stats.tx_bytes += priv->tx_len - 1;
can_led_event(net, CAN_LED_EVENT_TX);
@@ -750,6 +748,9 @@ static irqreturn_t hi3110_can_ist(int irq, void *dev_id)
}
netif_wake_queue(net);
}
+
+ if (intf == 0)
+ break;
}
mutex_unlock(&priv->hi3110_lock);
return IRQ_HANDLED;
--
2.17.0
From: Lukas Wunner <lukas(a)wunner.de>
hi3110_get_berr_counter() may run concurrently to the rest of the driver
but neglects to acquire the lock protecting access to the SPI device.
As a result, it and the rest of the driver may clobber each other's tx
and rx buffers.
We became aware of this issue because transmission of packets with
"cangen -g 0 -i -x" frequently hung. It turns out that agetty executes
->do_get_berr_counter every few seconds via the following call stack:
CPU: 2 PID: 1605 Comm: agetty
[<7f3f7500>] (hi3110_get_berr_counter [hi311x])
[<7f130204>] (can_fill_info [can_dev])
[<80693bc0>] (rtnl_fill_ifinfo)
[<806949ec>] (rtnl_dump_ifinfo)
[<806b4834>] (netlink_dump)
[<806b4bc8>] (netlink_recvmsg)
[<8065f180>] (sock_recvmsg)
[<80660f90>] (___sys_recvmsg)
[<80661e7c>] (__sys_recvmsg)
[<80661ec0>] (SyS_recvmsg)
[<80108b20>] (ret_fast_syscall+0x0/0x1c)
agetty listens to netlink messages in order to update the login prompt
when IP addresses change (if /etc/issue contains \4 or \6 escape codes):
https://git.kernel.org/pub/scm/utils/util-linux/util-linux.git/commit/?id=e…
It's a useful feature, though it seems questionable that it causes CAN
bit error statistics to be queried.
Be that as it may, if hi3110_get_berr_counter() is invoked while a frame
is sent by hi3110_hw_tx(), bogus SPI transfers like the following may
occur:
=> 12 00 (hi3110_get_berr_counter() wanted to transmit
EC 00 to query the transmit error counter,
but the first byte was overwritten by
hi3110_hw_tx_frame())
=> EA 00 3E 80 01 FB (hi3110_hw_tx_frame() wanted to transmit a
frame, but the first byte was overwritten by
hi3110_get_berr_counter() because it wanted
to query the receive error counter)
This sequence hangs the transmission because the driver believes it has
sent a frame and waits for the interrupt signaling completion, but in
reality the chip has never sent away the frame since the commands it
received were malformed.
Fix by acquiring the SPI lock in hi3110_get_berr_counter().
I've scrutinized the entire driver for further unlocked SPI accesses but
found no others.
Cc: Mathias Duckeck <m.duckeck(a)kunbus.de>
Cc: Akshay Bhat <akshay.bhat(a)timesys.com>
Cc: Casey Fitzpatrick <casey.fitzpatrick(a)timesys.com>
Cc: Stef Walter <stefw(a)redhat.com>
Cc: Karel Zak <kzak(a)redhat.com>
Cc: stable(a)vger.kernel.org # v4.12+
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Reviewed-by: Akshay Bhat <akshay.bhat(a)timesys.com>
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
---
drivers/net/can/spi/hi311x.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/can/spi/hi311x.c b/drivers/net/can/spi/hi311x.c
index 5590c559a8ca..c2cf254e4e95 100644
--- a/drivers/net/can/spi/hi311x.c
+++ b/drivers/net/can/spi/hi311x.c
@@ -427,8 +427,10 @@ static int hi3110_get_berr_counter(const struct net_device *net,
struct hi3110_priv *priv = netdev_priv(net);
struct spi_device *spi = priv->spi;
+ mutex_lock(&priv->hi3110_lock);
bec->txerr = hi3110_read(spi, HI3110_READ_TEC);
bec->rxerr = hi3110_read(spi, HI3110_READ_REC);
+ mutex_unlock(&priv->hi3110_lock);
return 0;
}
--
2.17.0
It looks like the NAND_STATUS_FAIL bit is sticky after an ECC failure,
which leads all READ operations following the failing one to report
an ECC failure. Reset the chip to clear the NAND_STATUS_FAIL bit.
Note that this behavior is not document in the datasheet, but resetting
the chip is the only solution we found to fix the problem.
Fixes: 9748e1d87573 ("mtd: nand: add support for Micron on-die ECC")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Boris Brezillon <boris.brezillon(a)bootlin.com>
Cc: Thomas Petazzoni <thomas.petazzoni(a)bootlin.com>
Cc: Bean Huo <beanhuo(a)micron.com>
Cc: Peter Pan <peterpandong(a)micron.com>
---
Peter, Bean,
Can you confirm this behavior, or ask someone in Micron who can confirm
it? Also, if a RESET is actually needed, it would be good to update the
datasheet accordingly. And if that's not the case, can you explain why
the NAND_STATUS_FAIL bit is stuck and how to clear it (I tried a 0x00
command, A.K.A. READ STATUS EXIT, but it does not clear this bit, ERASE
and PROGRAM seem to clear the bit, but that's clearly not the kind of
operation I can do when the user asks for a READ)?
Thanks,
Boris
---
drivers/mtd/nand/raw/nand_micron.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/drivers/mtd/nand/raw/nand_micron.c b/drivers/mtd/nand/raw/nand_micron.c
index 0af45b134c0c..a915f568f6a3 100644
--- a/drivers/mtd/nand/raw/nand_micron.c
+++ b/drivers/mtd/nand/raw/nand_micron.c
@@ -153,6 +153,23 @@ micron_nand_read_page_on_die_ecc(struct mtd_info *mtd, struct nand_chip *chip,
ret = nand_read_data_op(chip, chip->oob_poi, mtd->oobsize,
false);
+ /*
+ * Looks like the NAND_STATUS_FAIL bit is sticky after an ECC failure,
+ * which leads all READ operations following the failing one to report
+ * an ECC failure.
+ * Reset the chip to clear it.
+ *
+ * Note that this behavior is not document in the datasheet, but
+ * resetting the chip is the only solution we found to clear this bit.
+ */
+ if (status & NAND_STATUS_FAIL) {
+ int cs = page >> (chip->chip_shift - chip->page_shift);
+
+ chip->select_chip(mtd, -1);
+ nand_reset(chip, cs);
+ chip->select_chip(mtd, cs);
+ }
+
out:
micron_nand_on_die_ecc_setup(chip, false);
--
2.14.1
The patch titled
Subject: lib/test_bitmap.c: fix bitmap optimisation tests to report errors correctly
has been added to the -mm tree. Its filename is
lib-test_bitmapc-fix-bitmap-optimisation-tests-to-report-errors-correctly.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/lib-test_bitmapc-fix-bitmap-optimi…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/lib-test_bitmapc-fix-bitmap-optimi…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Matthew Wilcox <mawilcox(a)microsoft.com>
Subject: lib/test_bitmap.c: fix bitmap optimisation tests to report errors correctly
I had neglected to increment the error counter when the tests failed,
which made the tests noisy when they fail, but not actually return an
error code.
Link: http://lkml.kernel.org/r/20180509114328.9887-1-mpe@ellerman.id.au
Fixes: 3cc78125a081 ("lib/test_bitmap.c: add optimisation tests")
Signed-off-by: Matthew Wilcox <mawilcox(a)microsoft.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Reported-by: Michael Ellerman <mpe(a)ellerman.id.au>
Tested-by: Michael Ellerman <mpe(a)ellerman.id.au>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Cc: Yury Norov <ynorov(a)caviumnetworks.com>
Cc: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Cc: Geert Uytterhoeven <geert(a)linux-m68k.org>
Cc: <stable(a)vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/test_bitmap.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff -puN lib/test_bitmap.c~lib-test_bitmapc-fix-bitmap-optimisation-tests-to-report-errors-correctly lib/test_bitmap.c
--- a/lib/test_bitmap.c~lib-test_bitmapc-fix-bitmap-optimisation-tests-to-report-errors-correctly
+++ a/lib/test_bitmap.c
@@ -331,23 +331,32 @@ static void noinline __init test_mem_opt
unsigned int start, nbits;
for (start = 0; start < 1024; start += 8) {
- memset(bmap1, 0x5a, sizeof(bmap1));
- memset(bmap2, 0x5a, sizeof(bmap2));
for (nbits = 0; nbits < 1024 - start; nbits += 8) {
+ memset(bmap1, 0x5a, sizeof(bmap1));
+ memset(bmap2, 0x5a, sizeof(bmap2));
+
bitmap_set(bmap1, start, nbits);
__bitmap_set(bmap2, start, nbits);
- if (!bitmap_equal(bmap1, bmap2, 1024))
+ if (!bitmap_equal(bmap1, bmap2, 1024)) {
printk("set not equal %d %d\n", start, nbits);
- if (!__bitmap_equal(bmap1, bmap2, 1024))
+ failed_tests++;
+ }
+ if (!__bitmap_equal(bmap1, bmap2, 1024)) {
printk("set not __equal %d %d\n", start, nbits);
+ failed_tests++;
+ }
bitmap_clear(bmap1, start, nbits);
__bitmap_clear(bmap2, start, nbits);
- if (!bitmap_equal(bmap1, bmap2, 1024))
+ if (!bitmap_equal(bmap1, bmap2, 1024)) {
printk("clear not equal %d %d\n", start, nbits);
- if (!__bitmap_equal(bmap1, bmap2, 1024))
+ failed_tests++;
+ }
+ if (!__bitmap_equal(bmap1, bmap2, 1024)) {
printk("clear not __equal %d %d\n", start,
nbits);
+ failed_tests++;
+ }
}
}
}
_
Patches currently in -mm which might be from mawilcox(a)microsoft.com are
lib-test_bitmapc-fix-bitmap-optimisation-tests-to-report-errors-correctly.patch
slab-__gfp_zero-is-incompatible-with-a-constructor.patch
ida-remove-simple_ida_lock.patch
From: Dave Hansen <dave.hansen(a)linux.intel.com>
mm_pkey_is_allocated() treats pkey 0 as unallocated. That is
inconsistent with the manpages, and also inconsistent with
mm->context.pkey_allocation_map. Stop special casing it and only
disallow values that are actually bad (< 0).
The end-user visible effect of this is that you can now use
mprotect_pkey() to set pkey=0.
This is a bit nicer than what Ram proposed[1] because it is simpler
and removes special-casing for pkey 0. On the other hand, it does
allow applications to pkey_free() pkey-0, but that's just a silly
thing to do, so we are not going to protect against it.
The scenario that could happen is similar to what happens if you free
any other pkey that is in use: it might get reallocated later and used
to protect some other data. The most likely scenario is that pkey-0
comes back from pkey_alloc(), an access-disable or write-disable bit
is set in PKRU for it, and the next stack access will SIGSEGV. It's
not horribly different from if you mprotect()'d your stack or heap to
be unreadable or unwritable, which is generally very foolish, but also
not explicitly prevented by the kernel.
1. http://lkml.kernel.org/r/1522112702-27853-1-git-send-email-linuxram@us.ibm.…
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Fixes: 58ab9a088dda ("x86/pkeys: Check against max pkey to avoid overflows")
Cc: stable(a)vger.kernel.org
Cc: Ram Pai <linuxram(a)us.ibm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Michael Ellermen <mpe(a)ellerman.id.au>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>p
Cc: Shuah Khan <shuah(a)kernel.org>
---
b/arch/x86/include/asm/mmu_context.h | 2 +-
b/arch/x86/include/asm/pkeys.h | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
diff -puN arch/x86/include/asm/mmu_context.h~x86-pkey-0-default-allocated arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~x86-pkey-0-default-allocated 2018-05-09 09:20:24.362698393 -0700
+++ b/arch/x86/include/asm/mmu_context.h 2018-05-09 09:20:24.367698393 -0700
@@ -193,7 +193,7 @@ static inline int init_new_context(struc
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
- /* pkey 0 is the default and always allocated */
+ /* pkey 0 is the default and allocated implicitly */
mm->context.pkey_allocation_map = 0x1;
/* -1 means unallocated or invalid */
mm->context.execute_only_pkey = -1;
diff -puN arch/x86/include/asm/pkeys.h~x86-pkey-0-default-allocated arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~x86-pkey-0-default-allocated 2018-05-09 09:20:24.364698393 -0700
+++ b/arch/x86/include/asm/pkeys.h 2018-05-09 09:20:24.367698393 -0700
@@ -51,10 +51,10 @@ bool mm_pkey_is_allocated(struct mm_stru
{
/*
* "Allocated" pkeys are those that have been returned
- * from pkey_alloc(). pkey 0 is special, and never
- * returned from pkey_alloc().
+ * from pkey_alloc() or pkey 0 which is allocated
+ * implicitly when the mm is created.
*/
- if (pkey <= 0)
+ if (pkey < 0)
return false;
if (pkey >= arch_max_pkey())
return false;
_
From: Dave Hansen <dave.hansen(a)linux.intel.com>
I got a bug report that the following code (roughly) was
causing a SIGSEGV:
mprotect(ptr, size, PROT_EXEC);
mprotect(ptr, size, PROT_NONE);
mprotect(ptr, size, PROT_READ);
*ptr = 100;
The problem is hit when the mprotect(PROT_EXEC)
is implicitly assigned a protection key to the VMA, and made
that key ACCESS_DENY|WRITE_DENY. The PROT_NONE mprotect()
failed to remove the protection key, and the PROT_NONE->
PROT_READ left the PTE usable, but the pkey still in place
and left the memory inaccessible.
To fix this, we ensure that we always "override" the pkee
at mprotect() if the VMA does not have execute-only
permissions, but the VMA has the execute-only pkey.
We had a check for PROT_READ/WRITE, but it did not work
for PROT_NONE. This entirely removes the PROT_* checks,
which ensures that PROT_NONE now works.
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Fixes: 62b5f7d013f ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
Reported-by: Shakeel Butt <shakeelb(a)google.com>
Cc: stable(a)vger.kernel.org
Cc: Ram Pai <linuxram(a)us.ibm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Michael Ellermen <mpe(a)ellerman.id.au>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Shuah Khan <shuah(a)kernel.org>
---
b/arch/x86/include/asm/pkeys.h | 12 +++++++++++-
b/arch/x86/mm/pkeys.c | 21 +++++++++++----------
2 files changed, 22 insertions(+), 11 deletions(-)
diff -puN arch/x86/include/asm/pkeys.h~pkeys-abandon-exec-only-pkey-more-aggressively arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-abandon-exec-only-pkey-more-aggressively 2018-05-09 09:20:22.295698398 -0700
+++ b/arch/x86/include/asm/pkeys.h 2018-05-09 09:20:22.300698398 -0700
@@ -2,6 +2,8 @@
#ifndef _ASM_X86_PKEYS_H
#define _ASM_X86_PKEYS_H
+#define ARCH_DEFAULT_PKEY 0
+
#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)
extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
@@ -15,7 +17,7 @@ extern int __execute_only_pkey(struct mm
static inline int execute_only_pkey(struct mm_struct *mm)
{
if (!boot_cpu_has(X86_FEATURE_OSPKE))
- return 0;
+ return ARCH_DEFAULT_PKEY;
return __execute_only_pkey(mm);
}
@@ -56,6 +58,14 @@ bool mm_pkey_is_allocated(struct mm_stru
return false;
if (pkey >= arch_max_pkey())
return false;
+ /*
+ * The exec-only pkey is set in the allocation map, but
+ * is not available to any of the user interfaces like
+ * mprotect_pkey().
+ */
+ if (pkey == mm->context.execute_only_pkey)
+ return false;
+
return mm_pkey_allocation_map(mm) & (1U << pkey);
}
diff -puN arch/x86/mm/pkeys.c~pkeys-abandon-exec-only-pkey-more-aggressively arch/x86/mm/pkeys.c
--- a/arch/x86/mm/pkeys.c~pkeys-abandon-exec-only-pkey-more-aggressively 2018-05-09 09:20:22.297698398 -0700
+++ b/arch/x86/mm/pkeys.c 2018-05-09 09:20:22.301698398 -0700
@@ -94,26 +94,27 @@ int __arch_override_mprotect_pkey(struct
*/
if (pkey != -1)
return pkey;
- /*
- * Look for a protection-key-drive execute-only mapping
- * which is now being given permissions that are not
- * execute-only. Move it back to the default pkey.
- */
- if (vma_is_pkey_exec_only(vma) &&
- (prot & (PROT_READ|PROT_WRITE))) {
- return 0;
- }
+
/*
* The mapping is execute-only. Go try to get the
* execute-only protection key. If we fail to do that,
* fall through as if we do not have execute-only
- * support.
+ * support in this mm.
*/
if (prot == PROT_EXEC) {
pkey = execute_only_pkey(vma->vm_mm);
if (pkey > 0)
return pkey;
+ } else if (vma_is_pkey_exec_only(vma)) {
+ /*
+ * Protections are *not* PROT_EXEC, but the mapping
+ * is using the exec-only pkey. This mapping was
+ * PROT_EXEC and will no longer be. Move back to
+ * the default pkey.
+ */
+ return ARCH_DEFAULT_PKEY;
}
+
/*
* This is a vanilla, non-pkey mprotect (or we failed to
* setup execute-only), inherit the pkey from the VMA we
_