This is a partial revert of commit 8b3517f88ff2 ("PCI:
loongson: Prevent LS7A MRRS increases") for MIPS based Loongson.
There are many MIPS based Loongson systems in wild that
shipped with firmware which does not set maximum MRRS properly.
Limiting MRRS to 256 for all as MIPS Loongson comes with higher
MRRS support is considered rare.
It must be done at device enablement stage because MRRS setting
may get lost if the parent bridge lost PCI_COMMAND_MASTER, and
we are only sure parent bridge is enabled at this point.
Cc: stable(a)vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217680
Fixes: 8b3517f88ff2 ("PCI: loongson: Prevent LS7A MRRS increases")
Signed-off-by: Jiaxun Yang <jiaxun.yang(a)flygoat.com>
---
v4: Improve commit message
v5:
- Improve commit message and comments.
- Style fix from Huacai's off-list input.
v6: Fix a typo
---
drivers/pci/controller/pci-loongson.c | 47 ++++++++++++++++++++++++---
1 file changed, 42 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
index d45e7b8dc530..e181d99decf1 100644
--- a/drivers/pci/controller/pci-loongson.c
+++ b/drivers/pci/controller/pci-loongson.c
@@ -80,13 +80,50 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
DEV_LS7A_LPC, system_bus_quirk);
+/*
+ * Some Loongson PCIe ports have h/w limitations of maximum read
+ * request size. They can't handle anything larger than this.
+ * Sane firmware will set proper MRRS at boot, so we only need
+ * no_inc_mrrs for bridges. However, some MIPS Loongson firmware
+ * won't set MRRS properly, and we have to enforce maximum safe
+ * MRRS, which is 256 bytes.
+ */
+#ifdef CONFIG_MIPS
+static void loongson_set_min_mrrs_quirk(struct pci_dev *pdev)
+{
+ struct pci_bus *bus = pdev->bus;
+ struct pci_dev *bridge;
+ static const struct pci_device_id bridge_devids[] = {
+ { PCI_VDEVICE(LOONGSON, DEV_LS2K_PCIE_PORT0) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT0) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT1) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT2) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT3) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT4) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT5) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT6) },
+ { 0, },
+ };
+
+ /* look for the matching bridge */
+ while (!pci_is_root_bus(bus)) {
+ bridge = bus->self;
+ bus = bus->parent;
+
+ if (pci_match_id(bridge_devids, bridge)) {
+ if (pcie_get_readrq(pdev) > 256) {
+ pci_info(pdev, "limiting MRRS to 256\n");
+ pcie_set_readrq(pdev, 256);
+ }
+ break;
+ }
+ }
+}
+DECLARE_PCI_FIXUP_ENABLE(PCI_ANY_ID, PCI_ANY_ID, loongson_set_min_mrrs_quirk);
+#endif
+
static void loongson_mrrs_quirk(struct pci_dev *pdev)
{
- /*
- * Some Loongson PCIe ports have h/w limitations of maximum read
- * request size. They can't handle anything larger than this. So
- * force this limit on any devices attached under these ports.
- */
struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
bridge->no_inc_mrrs = 1;
--
2.34.1
Please merge commit 85c2ceaafbd3 ("mm/damon/sysfs: eliminate potential
uninitialized variable warning") to >=5.19 stable kernels.
In 2023-10-31, I sent[1] a fix for v5.19. After a week, Dan found an issue in
the fix and sent a fix. At that time, the commit that Dan was fixing was
merged in the mm tree but not in the mainline. Hence, Dan didn't Cc stable@.
However, now the broken fix[1] is merged in the mainline as commit 973233600676
("mm/damon/sysfs: update monitoring target regions for online input commit"),
and all >=5.19 stable trees. Hence Dan's fix should also applied to those
trees. Please apply those.
Note that the bug was only potential[3] due to unchecked return value.
However, the unchecked return value was not an intentional behavior but a bug.
Hence we further made the return value to be checked[4]. The return value
check fix is also merged in the relevant stable trees, so the fix is now needed
for a real bug.
[1] https://lore.kernel.org/all/20231031170131.46972-1-sj@kernel.org/
[2] https://lore.kernel.org/all/739e6aaf-a634-4e33-98a8-16546379ec9f@moroto.mou…
[3] https://lore.kernel.org/all/20231106165205.48264-1-sj@kernel.org/
[4] https://lore.kernel.org/all/20231106233408.51159-1-sj@kernel.org/
Thanks,
SJ
The patch titled
Subject: mm/mglru: reclaim offlined memcgs harder
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mglru-reclaim-offlined-memcgs-harder.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/mglru: reclaim offlined memcgs harder
Date: Thu, 7 Dec 2023 23:14:07 -0700
In the effort to reduce zombie memcgs [1], it was discovered that the
memcg LRU doesn't apply enough pressure on offlined memcgs. Specifically,
instead of rotating them to the tail of the current generation
(MEMCG_LRU_TAIL) for a second attempt, it moves them to the next
generation (MEMCG_LRU_YOUNG) after the first attempt.
Not applying enough pressure on offlined memcgs can cause them to build
up, and this can be particularly harmful to memory-constrained systems.
On Pixel 8 Pro, launching apps for 50 cycles:
Before After Change
Zombie memcgs 45 35 -22%
[1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA…
Link: https://lkml.kernel.org/r/20231208061407.2125867-4-yuzhao@google.com
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: T.J. Mercier <tjmercier(a)google.com>
Tested-by: T.J. Mercier <tjmercier(a)google.com>
Cc: Charan Teja Kalla <quic_charante(a)quicinc.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: Kalesh Singh <kaleshsingh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mmzone.h | 8 ++++----
mm/vmscan.c | 24 ++++++++++++++++--------
2 files changed, 20 insertions(+), 12 deletions(-)
--- a/include/linux/mmzone.h~mm-mglru-reclaim-offlined-memcgs-harder
+++ a/include/linux/mmzone.h
@@ -519,10 +519,10 @@ void lru_gen_look_around(struct page_vma
* 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
* 2. The first attempt to reclaim a memcg below low, which triggers
* MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim a memcg below reclaimable size threshold,
- * which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim a memcg below reclaimable size threshold,
- * which triggers MEMCG_LRU_YOUNG;
+ * 3. The first attempt to reclaim a memcg offlined or below reclaimable size
+ * threshold, which triggers MEMCG_LRU_TAIL;
+ * 4. The second attempt to reclaim a memcg offlined or below reclaimable size
+ * threshold, which triggers MEMCG_LRU_YOUNG;
* 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
* 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
* 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
--- a/mm/vmscan.c~mm-mglru-reclaim-offlined-memcgs-harder
+++ a/mm/vmscan.c
@@ -4598,7 +4598,12 @@ static bool should_run_aging(struct lruv
}
/* try to scrape all its memory if this memcg was deleted */
- *nr_to_scan = mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
+ if (!mem_cgroup_online(memcg)) {
+ *nr_to_scan = total;
+ return false;
+ }
+
+ *nr_to_scan = total >> sc->priority;
/*
* The aging tries to be lazy to reduce the overhead, while the eviction
@@ -4719,14 +4724,9 @@ static int shrink_one(struct lruvec *lru
bool success;
unsigned long scanned = sc->nr_scanned;
unsigned long reclaimed = sc->nr_reclaimed;
- int seg = lru_gen_memcg_seg(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
- /* see the comment on MEMCG_NR_GENS */
- if (!lruvec_is_sizable(lruvec, sc))
- return seg != MEMCG_LRU_TAIL ? MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
-
mem_cgroup_calculate_protection(NULL, memcg);
if (mem_cgroup_below_min(NULL, memcg))
@@ -4734,7 +4734,7 @@ static int shrink_one(struct lruvec *lru
if (mem_cgroup_below_low(NULL, memcg)) {
/* see the comment on MEMCG_NR_GENS */
- if (seg != MEMCG_LRU_TAIL)
+ if (lru_gen_memcg_seg(lruvec) != MEMCG_LRU_TAIL)
return MEMCG_LRU_TAIL;
memcg_memory_event(memcg, MEMCG_LOW);
@@ -4750,7 +4750,15 @@ static int shrink_one(struct lruvec *lru
flush_reclaim_state(sc);
- return success ? MEMCG_LRU_YOUNG : 0;
+ if (success && mem_cgroup_online(memcg))
+ return MEMCG_LRU_YOUNG;
+
+ if (!success && lruvec_is_sizable(lruvec, sc))
+ return 0;
+
+ /* one retry if offlined or too small */
+ return lru_gen_memcg_seg(lruvec) != MEMCG_LRU_TAIL ?
+ MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
}
#ifdef CONFIG_MEMCG
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-mglru-fix-underprotected-page-cache.patch
mm-mglru-try-to-stop-at-high-watermarks.patch
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
mm-mglru-reclaim-offlined-memcgs-harder.patch
The patch titled
Subject: mm/mglru: respect min_ttl_ms with memcgs
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/mglru: respect min_ttl_ms with memcgs
Date: Thu, 7 Dec 2023 23:14:06 -0700
While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru:
try to stop at high watermarks"), it was discovered that the memcg LRU can
breach the thrashing protection imposed by min_ttl_ms.
Before the memcg LRU:
kswapd()
shrink_node_memcgs()
mem_cgroup_iter()
inc_max_seq() // always hit a different memcg
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
After the memcg LRU:
kswapd()
shrink_many()
restart:
iterate the memcg LRU:
inc_max_seq() // occasionally hit the same memcg
if raced with lru_gen_rotate_memcg():
goto restart
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
Specifically, when the restart happens in shrink_many(), it needs to stick
with the (memcg LRU) generation it began with. In other words, it should
neither re-read memcg_lru->seq nor age an lruvec of a different
generation. Otherwise it can hit the same memcg multiple times without
giving lru_gen_age_node() a chance to check the timestamp of that memcg's
oldest generation (against min_ttl_ms).
[1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg…
Link: https://lkml.kernel.org/r/20231208061407.2125867-3-yuzhao@google.com
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Tested-by: T.J. Mercier <tjmercier(a)google.com>
Cc: Charan Teja Kalla <quic_charante(a)quicinc.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: Kalesh Singh <kaleshsingh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mmzone.h | 30 +++++++++++++++++-------------
mm/vmscan.c | 30 ++++++++++++++++--------------
2 files changed, 33 insertions(+), 27 deletions(-)
--- a/include/linux/mmzone.h~mm-mglru-respect-min_ttl_ms-with-memcgs
+++ a/include/linux/mmzone.h
@@ -505,33 +505,37 @@ void lru_gen_look_around(struct page_vma
* the old generation, is incremented when all its bins become empty.
*
* There are four operations:
- * 1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in its
+ * 1. MEMCG_LRU_HEAD, which moves a memcg to the head of a random bin in its
* current generation (old or young) and updates its "seg" to "head";
- * 2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in its
+ * 2. MEMCG_LRU_TAIL, which moves a memcg to the tail of a random bin in its
* current generation (old or young) and updates its "seg" to "tail";
- * 3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in the old
+ * 3. MEMCG_LRU_OLD, which moves a memcg to the head of a random bin in the old
* generation, updates its "gen" to "old" and resets its "seg" to "default";
- * 4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin in the
+ * 4. MEMCG_LRU_YOUNG, which moves a memcg to the tail of a random bin in the
* young generation, updates its "gen" to "young" and resets its "seg" to
* "default".
*
* The events that trigger the above operations are:
* 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
- * 2. The first attempt to reclaim an memcg below low, which triggers
+ * 2. The first attempt to reclaim a memcg below low, which triggers
* MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim an memcg below reclaimable size threshold,
+ * 3. The first attempt to reclaim a memcg below reclaimable size threshold,
* which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim an memcg below reclaimable size threshold,
+ * 4. The second attempt to reclaim a memcg below reclaimable size threshold,
* which triggers MEMCG_LRU_YOUNG;
- * 5. Attempting to reclaim an memcg below min, which triggers MEMCG_LRU_YOUNG;
+ * 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
* 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
- * 7. Offlining an memcg, which triggers MEMCG_LRU_OLD.
+ * 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
*
- * Note that memcg LRU only applies to global reclaim, and the round-robin
- * incrementing of their max_seq counters ensures the eventual fairness to all
- * eligible memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
+ * Notes:
+ * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
+ * of their max_seq counters ensures the eventual fairness to all eligible
+ * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
+ * 2. There are only two valid generations: old (seq) and young (seq+1).
+ * MEMCG_NR_GENS is set to three so that when reading the generation counter
+ * locklessly, a stale value (seq-1) does not wraparound to young.
*/
-#define MEMCG_NR_GENS 2
+#define MEMCG_NR_GENS 3
#define MEMCG_NR_BINS 8
struct lru_gen_memcg {
--- a/mm/vmscan.c~mm-mglru-respect-min_ttl_ms-with-memcgs
+++ a/mm/vmscan.c
@@ -4089,6 +4089,9 @@ static void lru_gen_rotate_memcg(struct
else
VM_WARN_ON_ONCE(true);
+ WRITE_ONCE(lruvec->lrugen.seg, seg);
+ WRITE_ONCE(lruvec->lrugen.gen, new);
+
hlist_nulls_del_rcu(&lruvec->lrugen.list);
if (op == MEMCG_LRU_HEAD || op == MEMCG_LRU_OLD)
@@ -4099,9 +4102,6 @@ static void lru_gen_rotate_memcg(struct
pgdat->memcg_lru.nr_memcgs[old]--;
pgdat->memcg_lru.nr_memcgs[new]++;
- lruvec->lrugen.gen = new;
- WRITE_ONCE(lruvec->lrugen.seg, seg);
-
if (!pgdat->memcg_lru.nr_memcgs[old] && old == get_memcg_gen(pgdat->memcg_lru.seq))
WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
@@ -4124,11 +4124,11 @@ void lru_gen_online_memcg(struct mem_cgr
gen = get_memcg_gen(pgdat->memcg_lru.seq);
+ lruvec->lrugen.gen = gen;
+
hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[gen][bin]);
pgdat->memcg_lru.nr_memcgs[gen]++;
- lruvec->lrugen.gen = gen;
-
spin_unlock_irq(&pgdat->memcg_lru.lock);
}
}
@@ -4635,7 +4635,7 @@ static long get_nr_to_scan(struct lruvec
DEFINE_MAX_SEQ(lruvec);
if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
- return 0;
+ return -1;
if (!should_run_aging(lruvec, max_seq, sc, can_swap, &nr_to_scan))
return nr_to_scan;
@@ -4710,7 +4710,7 @@ static bool try_to_shrink_lruvec(struct
cond_resched();
}
- /* whether try_to_inc_max_seq() was successful */
+ /* whether this lruvec should be rotated */
return nr_to_scan < 0;
}
@@ -4764,13 +4764,13 @@ static void shrink_many(struct pglist_da
struct lruvec *lruvec;
struct lru_gen_folio *lrugen;
struct mem_cgroup *memcg;
- const struct hlist_nulls_node *pos;
+ struct hlist_nulls_node *pos;
+ gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
restart:
op = 0;
memcg = NULL;
- gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
rcu_read_lock();
@@ -4781,6 +4781,10 @@ restart:
}
mem_cgroup_put(memcg);
+ memcg = NULL;
+
+ if (gen != READ_ONCE(lrugen->gen))
+ continue;
lruvec = container_of(lrugen, struct lruvec, lrugen);
memcg = lruvec_memcg(lruvec);
@@ -4865,16 +4869,14 @@ static void set_initial_priority(struct
if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH)
return;
/*
- * Determine the initial priority based on ((total / MEMCG_NR_GENS) >>
- * priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, where the
- * estimated reclaimed_to_scanned_ratio = inactive / total.
+ * Determine the initial priority based on
+ * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim,
+ * where reclaimed_to_scanned_ratio = inactive / total.
*/
reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE);
if (get_swappiness(lruvec, sc))
reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON);
- reclaimable /= MEMCG_NR_GENS;
-
/* round down reclaimable and round up sc->nr_to_reclaim */
priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1);
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-mglru-fix-underprotected-page-cache.patch
mm-mglru-try-to-stop-at-high-watermarks.patch
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
mm-mglru-reclaim-offlined-memcgs-harder.patch
The patch titled
Subject: mm/mglru: try to stop at high watermarks
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mglru-try-to-stop-at-high-watermarks.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/mglru: try to stop at high watermarks
Date: Thu, 7 Dec 2023 23:14:05 -0700
The initial MGLRU patchset didn't include the memcg LRU support, and it
relied on should_abort_scan(), added by commit f76c83378851 ("mm:
multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid
overshooting their aggregate reclaim target by too much".
Later on when the memcg LRU was added, should_abort_scan() was deemed
unnecessary, and the test results [1] showed no side effects after it was
removed by commit a579086c99ed ("mm: multi-gen LRU: remove eviction
fairness safeguard").
However, that test used memory.reclaim, which sets nr_to_reclaim to
SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1 pages,
i.e., from nr_reclaimed=nr_to_reclaim-1 to
nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch
size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore
that test isn't able to reproduce the worst case scenario, i.e., kswapd
overshooting GBs on large systems and "consuming 100% CPU" (see the Closes
tag).
Bring back a simplified version of should_abort_scan() on top of the memcg
LRU, so that kswapd stops when all eligible zones are above their
respective high watermarks plus a small delta to lower the chance of
KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies to order-0
reclaim, meaning compaction-induced reclaim can still run wild (which is a
different problem).
On Android, launching 55 apps sequentially:
Before After Change
pgpgin 838377172 802955040 -4%
pgpgout 38037080 34336300 -10%
[1] https://lore.kernel.org/20221222041905.2431096-1-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20231208061407.2125867-2-yuzhao@google.com
Fixes: a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: Charan Teja Kalla <quic_charante(a)quicinc.com>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg…
Tested-by: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Tested-by: Kalesh Singh <kaleshsingh(a)google.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: T.J. Mercier <tjmercier(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
--- a/mm/vmscan.c~mm-mglru-try-to-stop-at-high-watermarks
+++ a/mm/vmscan.c
@@ -4648,20 +4648,41 @@ static long get_nr_to_scan(struct lruvec
return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
}
-static unsigned long get_nr_to_reclaim(struct scan_control *sc)
+static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
{
+ int i;
+ enum zone_watermarks mark;
+
/* don't abort memcg reclaim to ensure fairness */
if (!root_reclaim(sc))
- return -1;
+ return false;
+
+ if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order)))
+ return true;
+
+ /* check the order to exclude compaction-induced reclaim */
+ if (!current_is_kswapd() || sc->order)
+ return false;
+
+ mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
+ WMARK_PROMO : WMARK_HIGH;
+
+ for (i = 0; i <= sc->reclaim_idx; i++) {
+ struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
+ unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH;
+
+ if (managed_zone(zone) && !zone_watermark_ok(zone, 0, size, sc->reclaim_idx, 0))
+ return false;
+ }
- return max(sc->nr_to_reclaim, compact_gap(sc->order));
+ /* kswapd should abort if all eligible zones are safe */
+ return true;
}
static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
long nr_to_scan;
unsigned long scanned = 0;
- unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
int swappiness = get_swappiness(lruvec, sc);
/* clean file folios are more likely to exist */
@@ -4683,7 +4704,7 @@ static bool try_to_shrink_lruvec(struct
if (scanned >= nr_to_scan)
break;
- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (should_abort_scan(lruvec, sc))
break;
cond_resched();
@@ -4744,7 +4765,6 @@ static void shrink_many(struct pglist_da
struct lru_gen_folio *lrugen;
struct mem_cgroup *memcg;
const struct hlist_nulls_node *pos;
- unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
restart:
@@ -4777,7 +4797,7 @@ restart:
rcu_read_lock();
- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (should_abort_scan(lruvec, sc))
break;
}
@@ -4788,7 +4808,7 @@ restart:
mem_cgroup_put(memcg);
- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (!is_a_nulls(pos))
return;
/* restart if raced with lru_gen_rotate_memcg() */
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-mglru-fix-underprotected-page-cache.patch
mm-mglru-try-to-stop-at-high-watermarks.patch
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
mm-mglru-reclaim-offlined-memcgs-harder.patch
The patch titled
Subject: mm/mglru: fix underprotected page cache
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mglru-fix-underprotected-page-cache.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/mglru: fix underprotected page cache
Date: Thu, 7 Dec 2023 23:14:04 -0700
Unmapped folios accessed through file descriptors can be underprotected.
Those folios are added to the oldest generation based on:
1. The fact that they are less costly to reclaim (no need to walk the
rmap and flush the TLB) and have less impact on performance (don't
cause major PFs and can be non-blocking if needed again).
2. The observation that they are likely to be single-use. E.g., for
client use cases like Android, its apps parse configuration files
and store the data in heap (anon); for server use cases like MySQL,
it reads from InnoDB files and holds the cached data for tables in
buffer pools (anon).
However, the oldest generation can be very short lived, and if so, it
doesn't provide the PID controller with enough time to respond to a surge
of refaults. (Note that the PID controller uses weighted refaults and
those from evicted generations only take a half of the whole weight.) In
other words, for a short lived generation, the moving average smooths out
the spike quickly.
To fix the problem:
1. For folios that are already on LRU, if they can be beyond the
tracking range of tiers, i.e., five accesses through file
descriptors, move them to the second oldest generation to give them
more time to age. (Note that tiers are used by the PID controller
to statistically determine whether folios accessed multiple times
through file descriptors are worth protecting.)
2. When adding unmapped folios to LRU, adjust the placement of them so
that they are not too close to the tail. The effect of this is
similar to the above.
On Android, launching 55 apps sequentially:
Before After Change
workingset_refault_anon 25641024 25598972 0%
workingset_refault_file 115016834 106178438 -8%
Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: Charan Teja Kalla <quic_charante(a)quicinc.com>
Tested-by: Kalesh Singh <kaleshsingh(a)google.com>
Cc: T.J. Mercier <tjmercier(a)google.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm_inline.h | 23 ++++++++++++++---------
mm/vmscan.c | 2 +-
mm/workingset.c | 6 +++---
3 files changed, 18 insertions(+), 13 deletions(-)
--- a/include/linux/mm_inline.h~mm-mglru-fix-underprotected-page-cache
+++ a/include/linux/mm_inline.h
@@ -232,22 +232,27 @@ static inline bool lru_gen_add_folio(str
if (folio_test_unevictable(folio) || !lrugen->enabled)
return false;
/*
- * There are three common cases for this page:
- * 1. If it's hot, e.g., freshly faulted in or previously hot and
- * migrated, add it to the youngest generation.
- * 2. If it's cold but can't be evicted immediately, i.e., an anon page
- * not in swapcache or a dirty page pending writeback, add it to the
- * second oldest generation.
- * 3. Everything else (clean, cold) is added to the oldest generation.
+ * There are four common cases for this page:
+ * 1. If it's hot, i.e., freshly faulted in, add it to the youngest
+ * generation, and it's protected over the rest below.
+ * 2. If it can't be evicted immediately, i.e., a dirty page pending
+ * writeback, add it to the second youngest generation.
+ * 3. If it should be evicted first, e.g., cold and clean from
+ * folio_rotate_reclaimable(), add it to the oldest generation.
+ * 4. Everything else falls between 2 & 3 above and is added to the
+ * second oldest generation if it's considered inactive, or the
+ * oldest generation otherwise. See lru_gen_is_active().
*/
if (folio_test_active(folio))
seq = lrugen->max_seq;
else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
(folio_test_reclaim(folio) &&
(folio_test_dirty(folio) || folio_test_writeback(folio))))
- seq = lrugen->min_seq[type] + 1;
- else
+ seq = lrugen->max_seq - 1;
+ else if (reclaiming || lrugen->min_seq[type] + MIN_NR_GENS >= lrugen->max_seq)
seq = lrugen->min_seq[type];
+ else
+ seq = lrugen->min_seq[type] + 1;
gen = lru_gen_from_seq(seq);
flags = (gen + 1UL) << LRU_GEN_PGOFF;
--- a/mm/vmscan.c~mm-mglru-fix-underprotected-page-cache
+++ a/mm/vmscan.c
@@ -4232,7 +4232,7 @@ static bool sort_folio(struct lruvec *lr
}
/* protected */
- if (tier > tier_idx) {
+ if (tier > tier_idx || refs == BIT(LRU_REFS_WIDTH)) {
int hist = lru_hist_from_seq(lrugen->min_seq[type]);
gen = folio_inc_gen(lruvec, folio, false);
--- a/mm/workingset.c~mm-mglru-fix-underprotected-page-cache
+++ a/mm/workingset.c
@@ -313,10 +313,10 @@ static void lru_gen_refault(struct folio
* 1. For pages accessed through page tables, hotter pages pushed out
* hot pages which refaulted immediately.
* 2. For pages accessed multiple times through file descriptors,
- * numbers of accesses might have been out of the range.
+ * they would have been protected by sort_folio().
*/
- if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
- folio_set_workingset(folio);
+ if (lru_gen_in_fault() || refs >= BIT(LRU_REFS_WIDTH) - 1) {
+ set_mask_bits(&folio->flags, 0, LRU_REFS_MASK | BIT(PG_workingset));
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
}
unlock:
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-mglru-fix-underprotected-page-cache.patch
mm-mglru-try-to-stop-at-high-watermarks.patch
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
mm-mglru-reclaim-offlined-memcgs-harder.patch
The patch titled
Subject: mm/damon/core: make damon_start() waits until kdamond_fn() starts
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-damon-core-make-damon_start-waits-until-kdamond_fn-starts.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: SeongJae Park <sj(a)kernel.org>
Subject: mm/damon/core: make damon_start() waits until kdamond_fn() starts
Date: Fri, 8 Dec 2023 17:50:18 +0000
The cleanup tasks of kdamond threads including reset of corresponding
DAMON context's ->kdamond field and decrease of global nr_running_ctxs
counter is supposed to be executed by kdamond_fn(). However, commit
0f91d13366a4 ("mm/damon: simplify stop mechanism") made neither
damon_start() nor damon_stop() ensure the corresponding kdamond has
started the execution of kdamond_fn().
As a result, the cleanup can be skipped if damon_stop() is called fast
enough after the previous damon_start(). Especially the skipped reset
of ->kdamond could cause a use-after-free.
Fix it by waiting for start of kdamond_fn() execution from
damon_start().
Link: https://lkml.kernel.org/r/20231208175018.63880-1-sj@kernel.org
Fixes: 0f91d13366a4 ("mm/damon: simplify stop mechanism")
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Reported-by: Jakub Acs <acsjakub(a)amazon.de>
Cc: Changbin Du <changbin.du(a)intel.com>
Cc: Jakub Acs <acsjakub(a)amazon.de>
Cc: <stable(a)vger.kernel.org> # 5.15.x
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/damon.h | 2 ++
mm/damon/core.c | 6 ++++++
2 files changed, 8 insertions(+)
--- a/include/linux/damon.h~mm-damon-core-make-damon_start-waits-until-kdamond_fn-starts
+++ a/include/linux/damon.h
@@ -559,6 +559,8 @@ struct damon_ctx {
* update
*/
unsigned long next_ops_update_sis;
+ /* for waiting until the execution of the kdamond_fn is started */
+ struct completion kdamond_started;
/* public: */
struct task_struct *kdamond;
--- a/mm/damon/core.c~mm-damon-core-make-damon_start-waits-until-kdamond_fn-starts
+++ a/mm/damon/core.c
@@ -445,6 +445,8 @@ struct damon_ctx *damon_new_ctx(void)
if (!ctx)
return NULL;
+ init_completion(&ctx->kdamond_started);
+
ctx->attrs.sample_interval = 5 * 1000;
ctx->attrs.aggr_interval = 100 * 1000;
ctx->attrs.ops_update_interval = 60 * 1000 * 1000;
@@ -668,11 +670,14 @@ static int __damon_start(struct damon_ct
mutex_lock(&ctx->kdamond_lock);
if (!ctx->kdamond) {
err = 0;
+ reinit_completion(&ctx->kdamond_started);
ctx->kdamond = kthread_run(kdamond_fn, ctx, "kdamond.%d",
nr_running_ctxs);
if (IS_ERR(ctx->kdamond)) {
err = PTR_ERR(ctx->kdamond);
ctx->kdamond = NULL;
+ } else {
+ wait_for_completion(&ctx->kdamond_started);
}
}
mutex_unlock(&ctx->kdamond_lock);
@@ -1433,6 +1438,7 @@ static int kdamond_fn(void *data)
pr_debug("kdamond (%d) starts\n", current->pid);
+ complete(&ctx->kdamond_started);
kdamond_init_intervals_sis(ctx);
if (ctx->ops.init)
_
Patches currently in -mm which might be from sj(a)kernel.org are
mm-damon-core-make-damon_start-waits-until-kdamond_fn-starts.patch
mm-damon-core-test-test-damon_split_region_ats-access-rate-copying.patch
mm-damon-core-implement-goal-oriented-feedback-driven-quota-auto-tuning.patch
mm-damon-core-implement-goal-oriented-feedback-driven-quota-auto-tuning-fix.patch
mm-damon-sysfs-schemes-implement-files-for-scheme-quota-goals-setup.patch
mm-damon-sysfs-schemes-commit-damos-quota-goals-user-input-to-damos.patch
mm-damon-sysfs-schemes-implement-a-command-for-scheme-quota-goals-only-commit.patch
mm-damon-core-test-add-a-unit-test-for-the-feedback-loop-algorithm.patch
selftests-damon-test-quota-goals-directory.patch
docs-mm-damon-design-document-damos-quota-auto-tuning.patch
docs-abi-damon-document-damos-quota-goals.patch
docs-admin-guide-mm-damon-usage-document-for-quota-goals.patch
The cleanup tasks of kdamond threads including reset of corresponding
DAMON context's ->kdamond field and decrease of global nr_running_ctxs
counter is supposed to be executed by kdamond_fn(). However, commit
0f91d13366a4 ("mm/damon: simplify stop mechanism") made neither
damon_start() nor damon_stop() ensure the corresponding kdamond has
started the execution of kdamond_fn().
As a result, the cleanup can be skipped if damon_stop() is called fast
enough after the previous damon_start(). Especially the skipped reset
of ->kdamond could cause a use-after-free.
Fix it by waiting for start of kdamond_fn() execution from
damon_start().
Fixes: 0f91d13366a4 ("mm/damon: simplify stop mechanism")
Reported-by: Jakub Acs <acsjakub(a)amazon.de>
Cc: <stable(a)vger.kernel.org> # 5.15.x
Signed-off-by: SeongJae Park <sj(a)kernel.org>
---
Note that the report has not publicly made, so this patch doesn't have a
Closes: tag.
include/linux/damon.h | 2 ++
mm/damon/core.c | 6 ++++++
2 files changed, 8 insertions(+)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index aa34ab433bc5..12510d8c51c6 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -579,6 +579,8 @@ struct damon_ctx {
* update
*/
unsigned long next_ops_update_sis;
+ /* for waiting until the execution of the kdamond_fn is started */
+ struct completion kdamond_started;
/* public: */
struct task_struct *kdamond;
diff --git a/mm/damon/core.c b/mm/damon/core.c
index f91715a58dc7..2c0cc65d041e 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -445,6 +445,8 @@ struct damon_ctx *damon_new_ctx(void)
if (!ctx)
return NULL;
+ init_completion(&ctx->kdamond_started);
+
ctx->attrs.sample_interval = 5 * 1000;
ctx->attrs.aggr_interval = 100 * 1000;
ctx->attrs.ops_update_interval = 60 * 1000 * 1000;
@@ -668,11 +670,14 @@ static int __damon_start(struct damon_ctx *ctx)
mutex_lock(&ctx->kdamond_lock);
if (!ctx->kdamond) {
err = 0;
+ reinit_completion(&ctx->kdamond_started);
ctx->kdamond = kthread_run(kdamond_fn, ctx, "kdamond.%d",
nr_running_ctxs);
if (IS_ERR(ctx->kdamond)) {
err = PTR_ERR(ctx->kdamond);
ctx->kdamond = NULL;
+ } else {
+ wait_for_completion(&ctx->kdamond_started);
}
}
mutex_unlock(&ctx->kdamond_lock);
@@ -1483,6 +1488,7 @@ static int kdamond_fn(void *data)
pr_debug("kdamond (%d) starts\n", current->pid);
+ complete(&ctx->kdamond_started);
kdamond_init_intervals_sis(ctx);
if (ctx->ops.init)
--
2.34.1
When a queue is unbound from the vfio_ap device driver, it is reset to
ensure its crypto data is not leaked when it is bound to another device
driver. If the queue is unbound due to the fact that the adapter or domain
was removed from the host's AP configuration, then attempting to reset it
will fail with response code 01 (APID not valid) getting returned from the
reset command. Let's ensure that the queue is assigned to the host's
configuration before resetting it.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: eeb386aeb5b7 ("s390/vfio-ap: handle config changed and scan complete notification")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 5db11d50b4b0..b6928fc3b395 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -2214,6 +2214,8 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
q = dev_get_drvdata(&apdev->device);
get_update_locks_for_queue(q);
matrix_mdev = q->matrix_mdev;
+ apid = AP_QID_CARD(q->apqn);
+ apqi = AP_QID_QUEUE(q->apqn);
if (matrix_mdev) {
/* If the queue is assigned to the guest's AP configuration */
@@ -2231,8 +2233,16 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
}
}
- vfio_ap_mdev_reset_queue(q);
- flush_work(&q->reset_work);
+ /*
+ * If the queue is not in the host's AP configuration, then resetting
+ * it will fail with response code 01, (APQN not valid); so, let's make
+ * sure it is in the host's config.
+ */
+ if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm) &&
+ test_bit_inv(apqi, (unsigned long *)matrix_dev->info.aqm)) {
+ vfio_ap_mdev_reset_queue(q);
+ flush_work(&q->reset_work);
+ }
done:
if (matrix_mdev)
--
2.43.0
When a queue is unbound from the vfio_ap device driver, if that queue is
assigned to a guest's AP configuration, its associated adapter is removed
because queues are defined to a guest via a matrix of adapters and
domains; so, it is not possible to remove a single queue.
If an adapter is removed from the guest's AP configuration, all associated
queues must be reset to prevent leaking crypto data should any of them be
assigned to a different guest or device driver. The one caveat is that if
the queue is being removed because the adapter or domain has been removed
from the host's AP configuration, then an attempt to reset the queue will
fail with response code 01, AP-queue number not valid; so resetting these
queues should be skipped.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: 09d31ff78793 ("s390/vfio-ap: hot plug/unplug of AP devices when probed/removed")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 39 ++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 9 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index f08321385058..5db11d50b4b0 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -2187,6 +2187,23 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
return ret;
}
+static void reset_queues_for_apid(struct ap_matrix_mdev *matrix_mdev,
+ unsigned long apid)
+{
+ DECLARE_BITMAP(apm_reset, AP_DEVICES);
+
+ /*
+ * If the adapter is not in the host's AP configuration, then resetting
+ * any queue for that adapter will fail with response code 01, (APQN not
+ * valid).
+ */
+ if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm)) {
+ bitmap_clear(apm_reset, 0, AP_DEVICES);
+ set_bit_inv(apid, apm_reset);
+ reset_queues_for_apids(matrix_mdev, apm_reset);
+ }
+}
+
void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
{
unsigned long apid, apqi;
@@ -2199,24 +2216,28 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
matrix_mdev = q->matrix_mdev;
if (matrix_mdev) {
- vfio_ap_unlink_queue_fr_mdev(q);
-
- apid = AP_QID_CARD(q->apqn);
- apqi = AP_QID_QUEUE(q->apqn);
-
- /*
- * If the queue is assigned to the guest's APCB, then remove
- * the adapter's APID from the APCB and hot it into the guest.
- */
+ /* If the queue is assigned to the guest's AP configuration */
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
+ /*
+ * Since the queues are defined via a matrix of adapters
+ * and domains, it is not possible to hot unplug a
+ * single queue; so, let's unplug the adapter.
+ */
clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apid(matrix_mdev, apid);
+ goto done;
}
}
vfio_ap_mdev_reset_queue(q);
flush_work(&q->reset_work);
+
+done:
+ if (matrix_mdev)
+ vfio_ap_unlink_queue_fr_mdev(q);
+
dev_set_drvdata(&apdev->device, NULL);
kfree(q);
release_update_locks_for_mdev(matrix_mdev);
--
2.43.0
When filtering the adapters from the configuration profile for a guest to
create or update a guest's AP configuration, if the APID of an adapter and
the APQI of a domain identify a queue device that is not bound to the
vfio_ap device driver, the APID of the adapter will be filtered because an
individual APQN can not be filtered due to the fact the APQNs are assigned
to an AP configuration as a matrix of APIDs and APQIs. Consequently, a
guest will not have access to all of the queues associated with the
filtered adapter. If the queues are subsequently made available again to
the guest, they should re-appear in a reset state; so, let's make sure all
queues associated with an adapter unplugged from the guest are reset.
In order to identify the set of queues that need to be reset, let's allow a
vfio_ap_queue object to be simultaneously stored in both a hashtable and a
list: A hashtable used to store all of the queues assigned
to a matrix mdev; and/or, a list used to store a subset of the queues that
need to be reset. For example, when an adapter is hot unplugged from a
guest, all guest queues associated with that adapter must be reset. Since
that may be a subset of those assigned to the matrix mdev, they can be
stored in a list that can be passed to the vfio_ap_mdev_reset_queues
function.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 157 +++++++++++++++++++-------
drivers/s390/crypto/vfio_ap_private.h | 11 +-
2 files changed, 126 insertions(+), 42 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 26bd4aca497a..f08321385058 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -33,6 +33,7 @@
#define AP_RESET_INTERVAL 20 /* Reset sleep interval (20ms) */
static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable);
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist);
static struct vfio_ap_queue *vfio_ap_find_queue(int apqn);
static const struct vfio_device_ops vfio_ap_matrix_dev_ops;
static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q);
@@ -661,16 +662,23 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
* device driver.
*
* @matrix_mdev: the matrix mdev whose matrix is to be filtered.
+ * @apm_filtered: a 256-bit bitmap for storing the APIDs filtered from the
+ * guest's AP configuration that are still in the host's AP
+ * configuration.
*
* Note: If an APQN referencing a queue device that is not bound to the vfio_ap
* driver, its APID will be filtered from the guest's APCB. The matrix
* structure precludes filtering an individual APQN, so its APID will be
- * filtered.
+ * filtered. Consequently, all queues associated with the adapter that
+ * are in the host's AP configuration must be reset. If queues are
+ * subsequently made available again to the guest, they should re-appear
+ * in a reset state
*
* Return: a boolean value indicating whether the KVM guest's APCB was changed
* by the filtering or not.
*/
-static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev,
+ unsigned long *apm_filtered)
{
unsigned long apid, apqi, apqn;
DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -680,6 +688,7 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
bitmap_copy(prev_shadow_apm, matrix_mdev->shadow_apcb.apm, AP_DEVICES);
bitmap_copy(prev_shadow_aqm, matrix_mdev->shadow_apcb.aqm, AP_DOMAINS);
vfio_ap_matrix_init(&matrix_dev->info, &matrix_mdev->shadow_apcb);
+ bitmap_clear(apm_filtered, 0, AP_DEVICES);
/*
* Copy the adapters, domains and control domains to the shadow_apcb
@@ -705,8 +714,16 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
apqn = AP_MKQID(apid, apqi);
q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
if (!q || q->reset_status.response_code) {
- clear_bit_inv(apid,
- matrix_mdev->shadow_apcb.apm);
+ clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
+
+ /*
+ * If the adapter was previously plugged into
+ * the guest, let's let the caller know that
+ * the APID was filtered.
+ */
+ if (test_bit_inv(apid, prev_shadow_apm))
+ set_bit_inv(apid, apm_filtered);
+
break;
}
}
@@ -918,6 +935,47 @@ static void vfio_ap_mdev_link_adapter(struct ap_matrix_mdev *matrix_mdev,
AP_MKQID(apid, apqi));
}
+static int reset_queues_for_apids(struct ap_matrix_mdev *matrix_mdev,
+ unsigned long *apm_reset)
+{
+ struct vfio_ap_queue *q, *tmpq;
+ struct list_head qlist;
+ unsigned long apid, apqi;
+ int apqn, ret = 0;
+
+ if (bitmap_empty(apm_reset, AP_DEVICES))
+ return 0;
+
+ INIT_LIST_HEAD(&qlist);
+
+ for_each_set_bit_inv(apid, apm_reset, AP_DEVICES) {
+ for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+ AP_DOMAINS) {
+ /*
+ * If the domain is not in the host's AP configuration,
+ * then resetting it will fail with response code 01
+ * (APQN not valid).
+ */
+ if (!test_bit_inv(apqi,
+ (unsigned long *)matrix_dev->info.aqm))
+ continue;
+
+ apqn = AP_MKQID(apid, apqi);
+ q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
+
+ if (q)
+ list_add_tail(&q->reset_qnode, &qlist);
+ }
+ }
+
+ ret = vfio_ap_mdev_reset_qlist(&qlist);
+
+ list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode)
+ list_del(&q->reset_qnode);
+
+ return ret;
+}
+
/**
* assign_adapter_store - parses the APID from @buf and sets the
* corresponding bit in the mediated matrix device's APM
@@ -958,6 +1016,7 @@ static ssize_t assign_adapter_store(struct device *dev,
{
int ret;
unsigned long apid;
+ DECLARE_BITMAP(apm_filtered, AP_DEVICES);
struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
mutex_lock(&ap_perms_mutex);
@@ -987,8 +1046,10 @@ static ssize_t assign_adapter_store(struct device *dev,
vfio_ap_mdev_link_adapter(matrix_mdev, apid);
- if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apids(matrix_mdev, apm_filtered);
+ }
ret = count;
done:
@@ -1019,11 +1080,12 @@ static struct vfio_ap_queue
* adapter was assigned.
* @matrix_mdev: the matrix mediated device to which the adapter was assigned.
* @apid: the APID of the unassigned adapter.
- * @qtable: table for storing queues associated with unassigned adapter.
+ * @qlist: list for storing queues associated with unassigned adapter that
+ * need to be reset.
*/
static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
unsigned long apid,
- struct ap_queue_table *qtable)
+ struct list_head *qlist)
{
unsigned long apqi;
struct vfio_ap_queue *q;
@@ -1031,11 +1093,10 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
- if (q && qtable) {
+ if (q && qlist) {
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
- hash_add(qtable->queues, &q->mdev_qnode,
- q->apqn);
+ list_add_tail(&q->reset_qnode, qlist);
}
}
}
@@ -1043,26 +1104,23 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
static void vfio_ap_mdev_hot_unplug_adapter(struct ap_matrix_mdev *matrix_mdev,
unsigned long apid)
{
- int loop_cursor;
- struct vfio_ap_queue *q;
- struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+ struct vfio_ap_queue *q, *tmpq;
+ struct list_head qlist;
- hash_init(qtable->queues);
- vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, qtable);
+ INIT_LIST_HEAD(&qlist);
+ vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, &qlist);
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm)) {
clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
}
- vfio_ap_mdev_reset_queues(qtable);
+ vfio_ap_mdev_reset_qlist(&qlist);
- hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+ list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
vfio_ap_unlink_mdev_fr_queue(q);
- hash_del(&q->mdev_qnode);
+ list_del(&q->reset_qnode);
}
-
- kfree(qtable);
}
/**
@@ -1163,6 +1221,7 @@ static ssize_t assign_domain_store(struct device *dev,
{
int ret;
unsigned long apqi;
+ DECLARE_BITMAP(apm_filtered, AP_DEVICES);
struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
mutex_lock(&ap_perms_mutex);
@@ -1192,8 +1251,10 @@ static ssize_t assign_domain_store(struct device *dev,
vfio_ap_mdev_link_domain(matrix_mdev, apqi);
- if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apids(matrix_mdev, apm_filtered);
+ }
ret = count;
done:
@@ -1206,7 +1267,7 @@ static DEVICE_ATTR_WO(assign_domain);
static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
unsigned long apqi,
- struct ap_queue_table *qtable)
+ struct list_head *qlist)
{
unsigned long apid;
struct vfio_ap_queue *q;
@@ -1214,11 +1275,10 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
- if (q && qtable) {
+ if (q && qlist) {
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
- hash_add(qtable->queues, &q->mdev_qnode,
- q->apqn);
+ list_add_tail(&q->reset_qnode, qlist);
}
}
}
@@ -1226,26 +1286,23 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
static void vfio_ap_mdev_hot_unplug_domain(struct ap_matrix_mdev *matrix_mdev,
unsigned long apqi)
{
- int loop_cursor;
- struct vfio_ap_queue *q;
- struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+ struct vfio_ap_queue *q, *tmpq;
+ struct list_head qlist;
- hash_init(qtable->queues);
- vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, qtable);
+ INIT_LIST_HEAD(&qlist);
+ vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, &qlist);
if (test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
clear_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm);
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
}
- vfio_ap_mdev_reset_queues(qtable);
+ vfio_ap_mdev_reset_qlist(&qlist);
- hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+ list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
vfio_ap_unlink_mdev_fr_queue(q);
- hash_del(&q->mdev_qnode);
+ list_del(&q->reset_qnode);
}
-
- kfree(qtable);
}
/**
@@ -1754,6 +1811,24 @@ static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable)
return ret;
}
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist)
+{
+ int ret = 0;
+ struct vfio_ap_queue *q;
+
+ list_for_each_entry(q, qlist, reset_qnode)
+ vfio_ap_mdev_reset_queue(q);
+
+ list_for_each_entry(q, qlist, reset_qnode) {
+ flush_work(&q->reset_work);
+
+ if (q->reset_status.response_code)
+ ret = -EIO;
+ }
+
+ return ret;
+}
+
static int vfio_ap_mdev_open_device(struct vfio_device *vdev)
{
struct ap_matrix_mdev *matrix_mdev =
@@ -2062,6 +2137,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
{
int ret;
struct vfio_ap_queue *q;
+ DECLARE_BITMAP(apm_filtered, AP_DEVICES);
struct ap_matrix_mdev *matrix_mdev;
ret = sysfs_create_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2094,15 +2170,17 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
!bitmap_empty(matrix_mdev->aqm_add, AP_DOMAINS))
goto done;
- if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apids(matrix_mdev, apm_filtered);
+ }
}
done:
dev_set_drvdata(&apdev->device, q);
release_update_locks_for_mdev(matrix_mdev);
- return 0;
+ return ret;
err_remove_group:
sysfs_remove_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2446,6 +2524,7 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
{
+ DECLARE_BITMAP(apm_filtered, AP_DEVICES);
bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
mutex_lock(&matrix_mdev->kvm->lock);
@@ -2459,7 +2538,7 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
matrix_mdev->adm_add, AP_DOMAINS);
if (filter_adapters || filter_domains)
- do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
+ do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered);
if (filter_cdoms)
do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
@@ -2467,6 +2546,8 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
if (do_hotplug)
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apids(matrix_mdev, apm_filtered);
+
mutex_unlock(&matrix_dev->mdevs_lock);
mutex_unlock(&matrix_mdev->kvm->lock);
}
diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h
index 88aff8b81f2f..20eac8b0f0b9 100644
--- a/drivers/s390/crypto/vfio_ap_private.h
+++ b/drivers/s390/crypto/vfio_ap_private.h
@@ -83,10 +83,10 @@ struct ap_matrix {
};
/**
- * struct ap_queue_table - a table of queue objects.
- *
- * @queues: a hashtable of queues (struct vfio_ap_queue).
- */
+ * struct ap_queue_table - a table of queue objects.
+ *
+ * @queues: a hashtable of queues (struct vfio_ap_queue).
+ */
struct ap_queue_table {
DECLARE_HASHTABLE(queues, 8);
};
@@ -133,6 +133,8 @@ struct ap_matrix_mdev {
* @apqn: the APQN of the AP queue device
* @saved_isc: the guest ISC registered with the GIB interface
* @mdev_qnode: allows the vfio_ap_queue struct to be added to a hashtable
+ * @reset_qnode: allows the vfio_ap_queue struct to be added to a list of queues
+ * that need to be reset
* @reset_status: the status from the last reset of the queue
* @reset_work: work to wait for queue reset to complete
*/
@@ -143,6 +145,7 @@ struct vfio_ap_queue {
#define VFIO_AP_ISC_INVALID 0xff
unsigned char saved_isc;
struct hlist_node mdev_qnode;
+ struct list_head reset_qnode;
struct ap_queue_status reset_status;
struct work_struct reset_work;
};
--
2.43.0
While filtering the mdev matrix, it doesn't make sense - and will have
unexpected results - to filter an APID from the matrix if the APID or one
of the associated APQIs is not in the host's AP configuration. There are
two reasons for this:
1. An adapter or domain that is not in the host's AP configuration can be
assigned to the matrix; this is known as over-provisioning. Queue
devices, however, are only created for adapters and domains in the
host's AP configuration, so there will be no queues associated with an
over-provisioned adapter or domain to filter.
2. The adapter or domain may have been externally removed from the host's
configuration via an SE or HMC attached to a DPM enabled LPAR. In this
case, the vfio_ap device driver would have been notified by the AP bus
via the on_config_changed callback and the adapter or domain would
have already been filtered.
Since the matrix_mdev->shadow_apcb.apm and matrix_mdev->shadow_apcb.aqm are
copied from the mdev matrix sans the APIDs and APQIs not in the host's AP
configuration, let's loop over those bitmaps instead of those assigned to
the matrix.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 9382b32e5bd1..47232e19a50e 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -691,8 +691,9 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
(unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
- for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
- for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
+ for_each_set_bit_inv(apid, matrix_mdev->shadow_apcb.apm, AP_DEVICES) {
+ for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+ AP_DOMAINS) {
/*
* If the APQN is not bound to the vfio_ap device
* driver, then we can't assign it to the guest's
--
2.43.0
The vfio_ap_mdev_filter_matrix function is called whenever a new adapter or
domain is assigned to the mdev. The purpose of the function is to update
the guest's AP configuration by filtering the matrix of adapters and
domains assigned to the mdev. When an adapter or domain is assigned, only
the APQNs associated with the APID of the new adapter or APQI of the new
domain are inspected. If an APQN does not reference a queue device bound to
the vfio_ap device driver, then it's APID will be filtered from the mdev's
matrix when updating the guest's AP configuration.
Inspecting only the APID of the new adapter or APQI of the new domain will
result in passing AP queues through to a guest that are not bound to the
vfio_ap device driver under certain circumstances. Consider the following:
guest's AP configuration (all also assigned to the mdev's matrix):
14.0004
14.0005
14.0006
16.0004
16.0005
16.0006
unassign domain 4
unbind queue 16.0005
assign domain 4
When domain 4 is re-assigned, since only domain 4 will be inspected, the
APQNs that will be examined will be:
14.0004
16.0004
Since both of those APQNs reference queue devices that are bound to the
vfio_ap device driver, nothing will get filtered from the mdev's matrix
when updating the guest's AP configuration. Consequently, queue 16.0005
will get passed through despite not being bound to the driver. This
violates the linux device model requirement that a guest shall only be
given access to devices bound to the device driver facilitating their
pass-through.
To resolve this problem, every adapter and domain assigned to the mdev will
be inspected when filtering the mdev's matrix.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 57 +++++++++----------------------
1 file changed, 17 insertions(+), 40 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 4db538a55192..9382b32e5bd1 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -670,8 +670,7 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
* Return: a boolean value indicating whether the KVM guest's APCB was changed
* by the filtering or not.
*/
-static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
- struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
{
unsigned long apid, apqi, apqn;
DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -692,8 +691,8 @@ static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
(unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
- for_each_set_bit_inv(apid, apm, AP_DEVICES) {
- for_each_set_bit_inv(apqi, aqm, AP_DOMAINS) {
+ for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
+ for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
/*
* If the APQN is not bound to the vfio_ap device
* driver, then we can't assign it to the guest's
@@ -958,7 +957,6 @@ static ssize_t assign_adapter_store(struct device *dev,
{
int ret;
unsigned long apid;
- DECLARE_BITMAP(apm_delta, AP_DEVICES);
struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
mutex_lock(&ap_perms_mutex);
@@ -987,11 +985,8 @@ static ssize_t assign_adapter_store(struct device *dev,
}
vfio_ap_mdev_link_adapter(matrix_mdev, apid);
- memset(apm_delta, 0, sizeof(apm_delta));
- set_bit_inv(apid, apm_delta);
- if (vfio_ap_mdev_filter_matrix(apm_delta,
- matrix_mdev->matrix.aqm, matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev))
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
ret = count;
@@ -1167,7 +1162,6 @@ static ssize_t assign_domain_store(struct device *dev,
{
int ret;
unsigned long apqi;
- DECLARE_BITMAP(aqm_delta, AP_DOMAINS);
struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
mutex_lock(&ap_perms_mutex);
@@ -1196,11 +1190,8 @@ static ssize_t assign_domain_store(struct device *dev,
}
vfio_ap_mdev_link_domain(matrix_mdev, apqi);
- memset(aqm_delta, 0, sizeof(aqm_delta));
- set_bit_inv(apqi, aqm_delta);
- if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm, aqm_delta,
- matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev))
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
ret = count;
@@ -2091,9 +2082,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
if (matrix_mdev) {
vfio_ap_mdev_link_queue(matrix_mdev, q);
- if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm,
- matrix_mdev->matrix.aqm,
- matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev))
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
}
dev_set_drvdata(&apdev->device, q);
@@ -2443,34 +2432,22 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
{
- bool do_hotplug = false;
- int filter_domains = 0;
- int filter_adapters = 0;
- DECLARE_BITMAP(apm, AP_DEVICES);
- DECLARE_BITMAP(aqm, AP_DOMAINS);
+ bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
mutex_lock(&matrix_mdev->kvm->lock);
mutex_lock(&matrix_dev->mdevs_lock);
- filter_adapters = bitmap_and(apm, matrix_mdev->matrix.apm,
- matrix_mdev->apm_add, AP_DEVICES);
- filter_domains = bitmap_and(aqm, matrix_mdev->matrix.aqm,
- matrix_mdev->aqm_add, AP_DOMAINS);
-
- if (filter_adapters && filter_domains)
- do_hotplug |= vfio_ap_mdev_filter_matrix(apm, aqm, matrix_mdev);
- else if (filter_adapters)
- do_hotplug |=
- vfio_ap_mdev_filter_matrix(apm,
- matrix_mdev->shadow_apcb.aqm,
- matrix_mdev);
- else
- do_hotplug |=
- vfio_ap_mdev_filter_matrix(matrix_mdev->shadow_apcb.apm,
- aqm, matrix_mdev);
+ filter_adapters = bitmap_intersects(matrix_mdev->matrix.apm,
+ matrix_mdev->apm_add, AP_DEVICES);
+ filter_domains = bitmap_intersects(matrix_mdev->matrix.aqm,
+ matrix_mdev->aqm_add, AP_DOMAINS);
+ filter_cdoms = bitmap_intersects(matrix_mdev->matrix.adm,
+ matrix_mdev->adm_add, AP_DOMAINS);
+
+ if (filter_adapters || filter_domains)
+ do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
- if (bitmap_intersects(matrix_mdev->matrix.adm, matrix_mdev->adm_add,
- AP_DOMAINS))
+ if (filter_cdoms)
do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
if (do_hotplug)
--
2.43.0
Check for additional CPUID bits to identify TDX guests running with Trust
Domain (TD) partitioning enabled. TD partitioning is like nested virtualization
inside the Trust Domain so there is a L1 TD VM(M) and there can be L2 TD VM(s).
In this arrangement we are not guaranteed that the TDX_CPUID_LEAF_ID is visible
to Linux running as an L2 TD VM. This is because a majority of TDX facilities
are controlled by the L1 VMM and the L2 TDX guest needs to use TD partitioning
aware mechanisms for what's left. So currently such guests do not have
X86_FEATURE_TDX_GUEST set.
We want the kernel to have X86_FEATURE_TDX_GUEST set for all TDX guests so we
need to check these additional CPUID bits, but we skip further initialization
in the function as we aren't guaranteed access to TDX module calls.
Cc: <stable(a)vger.kernel.org> # v6.5+
Signed-off-by: Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
---
arch/x86/coco/tdx/tdx.c | 29 ++++++++++++++++++++++++++---
arch/x86/include/asm/tdx.h | 3 +++
2 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 1d6b863c42b0..c7bbbaaf654d 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -8,6 +8,7 @@
#include <linux/export.h>
#include <linux/io.h>
#include <asm/coco.h>
+#include <asm/hyperv-tlfs.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
#include <asm/insn.h>
@@ -37,6 +38,8 @@
#define TDREPORT_SUBTYPE_0 0
+bool tdx_partitioning_active;
+
/* Called from __tdx_hypercall() for unrecoverable failure */
noinstr void __tdx_hypercall_failed(void)
{
@@ -757,19 +760,38 @@ static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
return true;
}
+
+static bool early_is_hv_tdx_partitioning(void)
+{
+ u32 eax, ebx, ecx, edx;
+ cpuid(HYPERV_CPUID_ISOLATION_CONFIG, &eax, &ebx, &ecx, &edx);
+ return eax & HV_PARAVISOR_PRESENT &&
+ (ebx & HV_ISOLATION_TYPE) == HV_ISOLATION_TYPE_TDX;
+}
+
void __init tdx_early_init(void)
{
u64 cc_mask;
u32 eax, sig[3];
cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
-
- if (memcmp(TDX_IDENT, sig, sizeof(sig)))
- return;
+ if (memcmp(TDX_IDENT, sig, sizeof(sig))) {
+ tdx_partitioning_active = early_is_hv_tdx_partitioning();
+ if (!tdx_partitioning_active)
+ return;
+ }
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
cc_vendor = CC_VENDOR_INTEL;
+
+ /*
+ * Need to defer cc_mask and page visibility callback initializations
+ * to a TD-partitioning aware implementation.
+ */
+ if (tdx_partitioning_active)
+ goto exit;
+
tdx_parse_tdinfo(&cc_mask);
cc_set_mask(cc_mask);
@@ -820,5 +842,6 @@ void __init tdx_early_init(void)
*/
x86_cpuinit.parallel_bringup = false;
+exit:
pr_info("Guest detected\n");
}
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 603e6d1e9d4a..fe22f8675859 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -52,6 +52,7 @@ bool tdx_early_handle_ve(struct pt_regs *regs);
int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport);
+extern bool tdx_partitioning_active;
#else
static inline void tdx_early_init(void) { };
@@ -71,6 +72,8 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
{
return -ENODEV;
}
+
+#define tdx_partitioning_active false
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
--
2.39.2
I'm announcing the release of the 5.10.203 kernel.
All users of the 5.10 kernel series must upgrade.
The updated 5.10.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.10.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/ABI/testing/sysfs-bus-usb | 11
Documentation/ABI/testing/sysfs-devices-removable | 17 +
Makefile | 2
arch/arm/xen/enlighten.c | 3
arch/mips/kvm/mmu.c | 3
arch/parisc/include/uapi/asm/errno.h | 2
arch/powerpc/kernel/fpu.S | 13
arch/powerpc/kernel/vector.S | 2
arch/s390/mm/page-states.c | 14
drivers/acpi/resource.c | 7
drivers/ata/pata_isapnp.c | 3
drivers/base/core.c | 28 +
drivers/base/dd.c | 4
drivers/cpufreq/imx6q-cpufreq.c | 32 +
drivers/firewire/core-device.c | 11
drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c | 5
drivers/gpu/drm/panel/panel-boe-tv101wum-nl6.c | 7
drivers/gpu/drm/panel/panel-simple.c | 13
drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 14
drivers/hid/hid-core.c | 16
drivers/hid/hid-debug.c | 3
drivers/infiniband/hw/i40iw/i40iw_ctrl.c | 6
drivers/infiniband/hw/i40iw/i40iw_type.h | 2
drivers/infiniband/hw/i40iw/i40iw_verbs.c | 10
drivers/input/joystick/xpad.c | 2
drivers/iommu/intel/iommu.c | 2
drivers/md/bcache/btree.c | 6
drivers/md/bcache/sysfs.c | 2
drivers/md/bcache/writeback.c | 22 +
drivers/md/dm-delay.c | 17 -
drivers/md/dm-verity-fec.c | 3
drivers/md/dm-verity-target.c | 4
drivers/md/dm-verity.h | 6
drivers/media/i2c/smiapp/smiapp-core.c | 2
drivers/misc/pci_endpoint_test.c | 12
drivers/mmc/core/block.c | 2
drivers/mmc/core/core.c | 15
drivers/mmc/core/regulator.c | 41 ++
drivers/mmc/host/cqhci.c | 44 +-
drivers/mmc/host/sdhci-sprd.c | 25 +
drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 14
drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c | 11
drivers/net/ethernet/amd/xgbe/xgbe-mdio.c | 14
drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 8
drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 2
drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c | 7
drivers/net/ethernet/realtek/r8169_main.c | 23 +
drivers/net/ethernet/renesas/ravb_main.c | 20 -
drivers/net/ethernet/stmicro/stmmac/mmc_core.c | 4
drivers/net/ethernet/xilinx/xilinx_axienet_main.c | 2
drivers/net/hyperv/netvsc_drv.c | 66 ++--
drivers/net/usb/ax88179_178a.c | 4
drivers/net/wireguard/device.c | 4
drivers/net/wireguard/receive.c | 12
drivers/net/wireguard/send.c | 3
drivers/nvme/target/core.c | 21 -
drivers/nvme/target/fabrics-cmd.c | 15
drivers/nvme/target/nvmet.h | 5
drivers/pci/controller/dwc/pci-keystone.c | 8
drivers/pinctrl/core.c | 6
drivers/s390/block/dasd.c | 24 -
drivers/scsi/qla2xxx/qla_os.c | 14
drivers/usb/core/config.c | 85 ++---
drivers/usb/core/hub.c | 13
drivers/usb/core/sysfs.c | 24 -
drivers/usb/dwc2/hcd_intr.c | 15
drivers/usb/dwc3/core.c | 2
drivers/usb/dwc3/drd.c | 2
drivers/usb/dwc3/dwc3-qcom.c | 52 ++-
drivers/usb/serial/option.c | 11
drivers/video/fbdev/sticore.h | 2
drivers/xen/swiotlb-xen.c | 1
fs/afs/dynroot.c | 4
fs/afs/internal.h | 1
fs/afs/server_list.c | 2
fs/afs/super.c | 2
fs/afs/vl_rotate.c | 10
fs/btrfs/disk-io.c | 1
fs/btrfs/ref-verify.c | 2
fs/btrfs/send.c | 2
fs/btrfs/super.c | 5
fs/btrfs/volumes.c | 9
fs/cifs/cifsfs.c | 1
fs/cifs/xattr.c | 5
fs/ext4/extents_status.c | 306 +++++++++++++------
fs/inode.c | 16
fs/nfsd/vfs.c | 12
include/linux/device.h | 37 ++
include/linux/fs.h | 45 ++
include/linux/hid.h | 5
include/linux/mmc/host.h | 3
include/linux/platform_data/x86/soc.h | 65 ++++
include/linux/usb.h | 7
include/linux/workqueue.h | 1
include/scsi/scsi_cmnd.h | 6
io_uring/io_uring.c | 2
kernel/locking/lockdep.c | 3
kernel/workqueue.c | 9
lib/errname.c | 6
net/ipv4/igmp.c | 6
net/ipv4/route.c | 2
net/smc/af_smc.c | 8
security/integrity/iint.c | 48 ++
sound/pci/hda/hda_intel.c | 2
sound/pci/hda/patch_realtek.c | 12
sound/soc/generic/simple-card.c | 6
sound/soc/intel/common/soc-intel-quirks.h | 51 ---
sound/soc/sof/sof-pci-dev.c | 62 +++
tools/arch/parisc/include/uapi/asm/errno.h | 2
tools/testing/selftests/net/ipsec.c | 4
tools/testing/selftests/net/mptcp/mptcp_connect.c | 11
111 files changed, 1196 insertions(+), 512 deletions(-)
Abdul Halim, Mohd Syazwan (1):
iommu/vt-d: Add MTL to quirk list to skip TE disabling
Adrian Hunter (5):
mmc: block: Do not lose cache flush during CQE error recovery
mmc: cqhci: Increase recovery halt timeout
mmc: cqhci: Warn of halt or task clear failure
mmc: cqhci: Fix task clearing in CQE error recovery
mmc: block: Retry commands in CQE error recovery
Al Viro (1):
nfsd: lock_rename() needs both directories to live on the same fs
Alan Stern (1):
USB: core: Change configuration warnings to notices
Alex Deucher (1):
drm/amdgpu: don't use ATRM for external devices
Alexander Gordeev (1):
s390/mm: fix phys vs virt confusion in mark_kernel_pXd() functions family
Alexander Stein (1):
usb: dwc3: Fix default mode initialization
Amir Goldstein (1):
ima: annotate iint mutex to avoid lockdep false positive warnings
Andrey Grodzovsky (1):
Revert "workqueue: remove unused cancel_work()"
Asuna Yang (1):
USB: serial: option: add Luat Air72*U series products
Baokun Li (8):
ext4: add a new helper to check if es must be kept
ext4: factor out __es_alloc_extent() and __es_free_extent()
ext4: use pre-allocated es in __es_insert_extent()
ext4: use pre-allocated es in __es_remove_extent()
ext4: using nofail preallocation in ext4_es_remove_extent()
ext4: using nofail preallocation in ext4_es_insert_delayed_block()
ext4: using nofail preallocation in ext4_es_insert_extent()
ext4: fix slab-use-after-free in ext4_es_insert_extent()
Bart Van Assche (2):
scsi: core: Introduce the scsi_cmd_to_rq() function
scsi: qla2xxx: Use scsi_cmd_to_rq() instead of scsi_cmnd.request
Benjamin Tissoires (1):
HID: core: store the unique system identifier in hid_device
Bragatheswaran Manickavel (1):
btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod()
Chaitanya Kulkarni (1):
nvmet: remove unnecessary ctrl parameter
Charles Yi (1):
HID: fix HID device resource race between HID core and debugging support
Chen Ni (1):
ata: pata_isapnp: Add missing error check for devm_ioport_map()
Christoph Hellwig (1):
nvmet: nul-terminate the NQNs passed in the connect command
Christoph Niedermaier (2):
cpufreq: imx6q: don't warn for disabling a non-existing frequency
cpufreq: imx6q: Don't disable 792 Mhz OPP unnecessarily
Christopher Bednarz (1):
RDMA/irdma: Prevent zero-length STAG registration
Claudiu Beznea (2):
net: ravb: Use pm_runtime_resume_and_get()
net: ravb: Start TX queues after HW initialization succeeded
Coly Li (2):
bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce()
bcache: check return value from btree_node_alloc_replacement()
D. Wythe (1):
net/smc: avoid data corruption caused by decline
David Howells (4):
afs: Fix afs_server_list to be cleaned up with RCU
afs: Make error on cell lookup failure consistent with OpenAFS
afs: Return ENOENT if no cell DNS record can be found
afs: Fix file locking on R/O volumes to operate in local mode
Eric Dumazet (1):
wireguard: use DEV_STATS_INC()
Filipe Manana (2):
btrfs: fix off-by-one when checking chunk map includes logical address
btrfs: make error messages more clear when getting a chunk map
Furong Xu (1):
net: stmmac: xgmac: Disable FPE MMC interrupts
Geetha sowjanya (1):
octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64
Greg Kroah-Hartman (1):
Linux 5.10.203
Haiyang Zhang (2):
hv_netvsc: Fix race of register_netdevice_notifier and VF register
hv_netvsc: fix race of netvsc and VF register_netdevice
Hans de Goede (2):
ACPI: resource: Skip IRQ override on ASUS ExpertBook B1402CVA
ASoC: Intel: Move soc_intel_is_foo() helpers to a generic header
Heiko Carstens (1):
s390/cmma: fix detection of DAT pages
Heiner Kallweit (4):
r8169: prevent potential deadlock in rtl8169_close
mmc: core: add helpers mmc_regulator_enable/disable_vqmmc
r8169: disable ASPM in case of tx timeout
r8169: fix deadlock on RTL8125 in jumbo mtu mode
Helge Deller (2):
parisc: Drop the HP-UX ENOSYM and EREMOTERELEASE error codes
fbdev: stifb: Make the STI next font pointer a 32-bit signed offset
Huacai Chen (1):
MIPS: KVM: Fix a build warning about variable set but not used
Ioana Ciornei (1):
dpaa2-eth: increase the needed headroom to account for alignment
Jan Höppner (1):
s390/dasd: protect device queue against concurrent access
Jann Horn (1):
btrfs: send: ensure send_fd is writable
Jeff Layton (1):
fs: add ctime accessors infrastructure
Johan Hovold (3):
USB: dwc3: qcom: fix resource leaks on probe deferral
USB: dwc3: qcom: fix ACPI platform device leak
USB: dwc3: qcom: fix wakeup after probe deferral
Jonas Karlman (1):
drm/rockchip: vop: Fix color for RGB888/BGR888 format on VOP full
Jose Ignacio Tornos Martinez (1):
net: usb: ax88179_178a: fix failed operations during ax88179_reset
Kailang Yang (2):
ALSA: hda/realtek: Headset Mic VREF to 100%
ALSA: hda/realtek: Add supported ALC257 for ChromeOS
Keith Busch (2):
swiotlb-xen: provide the "max_mapping_size" method
io_uring: fix off-by one bvec index
Kishon Vijay Abraham I (1):
misc: pci_endpoint_test: Add deviceID for AM64 and J7200
Kuninori Morimoto (1):
ASoC: simple-card: fixup asoc_simple_probe() error handling
Kunwu Chan (1):
ipv4: Correct/silence an endian warning in __ip_do_redirect
Lech Perczak (1):
USB: serial: option: don't claim interface 4 for ZTE MF290
Long Li (1):
hv_netvsc: Mark VF as slave before exposing it to user-mode
Marek Vasut (2):
drm/panel: simple: Fix Innolux G101ICE-L01 bus flags
drm/panel: simple: Fix Innolux G101ICE-L01 timings
Maria Yu (1):
pinctrl: avoid reload of p state in list iteration
Mark Hasemeyer (1):
ASoC: SOF: sof-pci-dev: Fix community key quirk detection
Markus Weippert (1):
bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
Max Nguyen (1):
Input: xpad - add HyperX Clutch Gladiate Support
Mikulas Patocka (2):
dm-delay: fix a race between delay_presuspend and delay_bio
dm-verity: align struct dm_verity_fec_io properly
Mingzhe Zou (3):
bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race
bcache: fixup init dirty data errors
bcache: fixup lock c->root error
Nathan Chancellor (1):
PCI: keystone: Drop __init from ks_pcie_add_pcie_{ep,port}()
Niklas Neronin (1):
usb: config: fix iteration issue in 'usb_get_bos_descriptor()'
Oliver Neukum (1):
USB: dwc2: write HCINT with INTMASK applied
Peter Zijlstra (1):
lockdep: Fix block chain corruption
Pierre-Louis Bossart (3):
ASoC: SOF: sof-pci-dev: use community key on all Up boards
ASoC: SOF: sof-pci-dev: add parameter to override topology filename
ASoC: SOF: sof-pci-dev: don't use the community key on APL Chromebooks
Puliang Lu (1):
USB: serial: option: fix FM101R-GL defines
Qu Wenruo (1):
btrfs: add dmesg output for first mount and last unmount of a filesystem
Quinn Tran (1):
scsi: qla2xxx: Fix system crash due to bad pointer access
Rajat Jain (1):
driver core: Move the "removable" attribute from USB to core
Raju Rangoju (3):
amd-xgbe: handle corner-case during sfp hotplug
amd-xgbe: handle the corner-case during tx completion
amd-xgbe: propagate the correct speed and duplex status
Rand Deeb (1):
bcache: prevent potential division by zero error
Ricardo Ribalda (1):
usb: dwc3: set the dma max_seg_size
Sakari Ailus (1):
media: ccs: Correctly initialise try compose rectangle
Samuel Holland (1):
net: axienet: Fix check for partial TX checksum
Saravana Kannan (1):
driver core: Release all resources during unbind before updating device links
Shuijing Li (1):
drm/panel: boe-tv101wum-nl6: Fine tune the panel power sequence
Siddharth Vadapalli (1):
misc: pci_endpoint_test: Add deviceID for J721S2 PCIe EP device support
Stefano Stabellini (1):
arm/xen: fix xen_vcpu_info allocation alignment
Steve French (2):
smb3: fix touch -h of symlink
smb3: fix caching of ctime on setxattr
Takashi Iwai (1):
ALSA: hda: Disable power-save on KONTRON SinglePC
Timothy Pearson (1):
powerpc: Don't clobber f0/vs0 during fp|altivec register save
Victor Fragoso (1):
USB: serial: option: add Fibocom L7xx modules
Wenchao Chen (1):
mmc: sdhci-sprd: Fix vqmmc not shutting down after the card was pulled
Willem de Bruijn (2):
selftests/net: ipsec: fix constant out of range
selftests/net: mptcp: fix uninitialized variable warnings
Wu Bo (1):
dm verity: don't perform FEC for failed readahead IO
Xuxin Xiong (1):
drm/panel: auo,b101uan08.3: Fine tune the panel power sequence
Yang Yingliang (1):
firewire: core: fix possible memory leak in create_units()
Yoshihiro Shimoda (1):
ravb: Fix races between ravb_tx_timeout_work() and net related ops
Zhang Yi (1):
ext4: make sure allocate pending entry not fail
Zheng Yongjun (1):
mmc: core: convert comma to semicolon
Zhengchao Shao (1):
ipv4: igmp: fix refcnt uaf issue when receiving igmp query packet
I'm announcing the release of the 5.4.263 kernel.
All users of the 5.4 kernel series must upgrade.
The updated 5.4.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.4.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/arm/xen/enlighten.c | 3
arch/arm64/include/asm/cpufeature.h | 23 +
arch/arm64/include/asm/sysreg.h | 6
arch/arm64/kvm/sys_regs.c | 10
arch/mips/kvm/mmu.c | 3
arch/powerpc/kernel/fpu.S | 13
arch/powerpc/kernel/vector.S | 2
arch/s390/mm/page-states.c | 14 -
drivers/acpi/resource.c | 7
drivers/ata/pata_isapnp.c | 3
drivers/base/dd.c | 4
drivers/cpufreq/imx6q-cpufreq.c | 32 +-
drivers/firewire/core-device.c | 11
drivers/gpu/drm/panel/panel-simple.c | 13
drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 14 -
drivers/hid/hid-core.c | 16 -
drivers/hid/hid-debug.c | 3
drivers/infiniband/hw/i40iw/i40iw_ctrl.c | 6
drivers/infiniband/hw/i40iw/i40iw_type.h | 2
drivers/infiniband/hw/i40iw/i40iw_verbs.c | 10
drivers/input/joystick/xpad.c | 2
drivers/md/bcache/btree.c | 6
drivers/md/bcache/sysfs.c | 2
drivers/md/dm-delay.c | 17 -
drivers/md/dm-verity-fec.c | 3
drivers/md/dm-verity-target.c | 4
drivers/md/dm-verity.h | 6
drivers/mmc/core/block.c | 2
drivers/mmc/core/core.c | 15 -
drivers/mmc/host/cqhci.c | 44 +--
drivers/mtd/chips/cfi_cmdset_0001.c | 29 +-
drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 14 +
drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c | 11
drivers/net/ethernet/amd/xgbe/xgbe-mdio.c | 14 -
drivers/net/ethernet/renesas/ravb_main.c | 20 +
drivers/net/ethernet/stmicro/stmmac/mmc_core.c | 4
drivers/net/ethernet/xilinx/xilinx_axienet_main.c | 2
drivers/net/hyperv/netvsc_drv.c | 41 ++
drivers/net/usb/ax88179_178a.c | 4
drivers/nvme/target/core.c | 21 -
drivers/nvme/target/fabrics-cmd.c | 15 -
drivers/nvme/target/nvmet.h | 5
drivers/pci/controller/dwc/pci-keystone.c | 8
drivers/pinctrl/core.c | 6
drivers/s390/block/dasd.c | 24 -
drivers/scsi/qla2xxx/qla_def.h | 3
drivers/scsi/qla2xxx/qla_isr.c | 5
drivers/scsi/qla2xxx/qla_os.c | 39 +-
drivers/usb/dwc2/hcd_intr.c | 15 -
drivers/usb/dwc3/core.c | 2
drivers/usb/dwc3/dwc3-qcom.c | 17 -
drivers/usb/serial/option.c | 11
drivers/video/fbdev/sticore.h | 2
fs/afs/dynroot.c | 4
fs/afs/super.c | 2
fs/afs/vl_rotate.c | 10
fs/btrfs/disk-io.c | 1
fs/btrfs/send.c | 2
fs/btrfs/super.c | 5
fs/btrfs/volumes.c | 9
fs/cifs/cifsfs.c | 1
fs/ext4/extents_status.c | 306 +++++++++++++++-------
fs/io_uring.c | 2
fs/overlayfs/super.c | 5
fs/sync.c | 3
include/linux/fs.h | 2
include/linux/hid.h | 5
include/scsi/scsi_cmnd.h | 6
net/ipv4/igmp.c | 6
net/ipv4/route.c | 2
security/integrity/iint.c | 48 ++-
security/integrity/ima/ima_api.c | 5
security/integrity/ima/ima_main.c | 16 +
security/integrity/integrity.h | 2
sound/pci/hda/hda_intel.c | 2
sound/pci/hda/patch_realtek.c | 12
77 files changed, 752 insertions(+), 314 deletions(-)
Adrian Hunter (5):
mmc: block: Do not lose cache flush during CQE error recovery
mmc: cqhci: Increase recovery halt timeout
mmc: cqhci: Warn of halt or task clear failure
mmc: cqhci: Fix task clearing in CQE error recovery
mmc: block: Retry commands in CQE error recovery
Alexander Gordeev (1):
s390/mm: fix phys vs virt confusion in mark_kernel_pXd() functions family
Amir Goldstein (1):
ima: annotate iint mutex to avoid lockdep false positive warnings
Andrew Murray (2):
arm64: cpufeature: Extract capped perfmon fields
KVM: arm64: limit PMU version to PMUv3 for ARMv8.1
Asuna Yang (1):
USB: serial: option: add Luat Air72*U series products
Baokun Li (8):
ext4: add a new helper to check if es must be kept
ext4: factor out __es_alloc_extent() and __es_free_extent()
ext4: use pre-allocated es in __es_insert_extent()
ext4: use pre-allocated es in __es_remove_extent()
ext4: using nofail preallocation in ext4_es_remove_extent()
ext4: using nofail preallocation in ext4_es_insert_delayed_block()
ext4: using nofail preallocation in ext4_es_insert_extent()
ext4: fix slab-use-after-free in ext4_es_insert_extent()
Bart Van Assche (3):
scsi: qla2xxx: Simplify the code for aborting SCSI commands
scsi: core: Introduce the scsi_cmd_to_rq() function
scsi: qla2xxx: Use scsi_cmd_to_rq() instead of scsi_cmnd.request
Benjamin Tissoires (1):
HID: core: store the unique system identifier in hid_device
Chaitanya Kulkarni (1):
nvmet: remove unnecessary ctrl parameter
Charles Yi (1):
HID: fix HID device resource race between HID core and debugging support
Chen Ni (1):
ata: pata_isapnp: Add missing error check for devm_ioport_map()
Christoph Hellwig (1):
nvmet: nul-terminate the NQNs passed in the connect command
Christoph Niedermaier (2):
cpufreq: imx6q: don't warn for disabling a non-existing frequency
cpufreq: imx6q: Don't disable 792 Mhz OPP unnecessarily
Christopher Bednarz (1):
RDMA/irdma: Prevent zero-length STAG registration
Claudiu Beznea (2):
net: ravb: Use pm_runtime_resume_and_get()
net: ravb: Start TX queues after HW initialization succeeded
Coly Li (2):
bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce()
bcache: check return value from btree_node_alloc_replacement()
David Howells (3):
afs: Make error on cell lookup failure consistent with OpenAFS
afs: Return ENOENT if no cell DNS record can be found
afs: Fix file locking on R/O volumes to operate in local mode
Filipe Manana (2):
btrfs: fix off-by-one when checking chunk map includes logical address
btrfs: make error messages more clear when getting a chunk map
Furong Xu (1):
net: stmmac: xgmac: Disable FPE MMC interrupts
Greg Kroah-Hartman (1):
Linux 5.4.263
Haiyang Zhang (1):
hv_netvsc: Fix race of register_netdevice_notifier and VF register
Hans de Goede (1):
ACPI: resource: Skip IRQ override on ASUS ExpertBook B1402CVA
Heiko Carstens (1):
s390/cmma: fix detection of DAT pages
Helge Deller (1):
fbdev: stifb: Make the STI next font pointer a 32-bit signed offset
Huacai Chen (1):
MIPS: KVM: Fix a build warning about variable set but not used
Jan Höppner (1):
s390/dasd: protect device queue against concurrent access
Jann Horn (1):
btrfs: send: ensure send_fd is writable
Jean-Philippe Brucker (1):
mtd: cfi_cmdset_0001: Support the absence of protection registers
Johan Hovold (2):
USB: dwc3: qcom: fix resource leaks on probe deferral
USB: dwc3: qcom: fix wakeup after probe deferral
Jonas Karlman (1):
drm/rockchip: vop: Fix color for RGB888/BGR888 format on VOP full
Jose Ignacio Tornos Martinez (1):
net: usb: ax88179_178a: fix failed operations during ax88179_reset
Kailang Yang (2):
ALSA: hda/realtek: Headset Mic VREF to 100%
ALSA: hda/realtek: Add supported ALC257 for ChromeOS
Keith Busch (1):
io_uring: fix off-by one bvec index
Konstantin Khlebnikov (1):
ovl: skip overlayfs superblocks at global sync
Kunwu Chan (1):
ipv4: Correct/silence an endian warning in __ip_do_redirect
Lech Perczak (1):
USB: serial: option: don't claim interface 4 for ZTE MF290
Linus Walleij (1):
mtd: cfi_cmdset_0001: Byte swap OTP info
Long Li (1):
hv_netvsc: Mark VF as slave before exposing it to user-mode
Marek Vasut (2):
drm/panel: simple: Fix Innolux G101ICE-L01 bus flags
drm/panel: simple: Fix Innolux G101ICE-L01 timings
Maria Yu (1):
pinctrl: avoid reload of p state in list iteration
Markus Weippert (1):
bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
Max Nguyen (1):
Input: xpad - add HyperX Clutch Gladiate Support
Mikulas Patocka (2):
dm-delay: fix a race between delay_presuspend and delay_bio
dm-verity: align struct dm_verity_fec_io properly
Mimi Zohar (1):
ima: detect changes to the backing overlay file
Nathan Chancellor (1):
PCI: keystone: Drop __init from ks_pcie_add_pcie_{ep,port}()
Oliver Neukum (1):
USB: dwc2: write HCINT with INTMASK applied
Puliang Lu (1):
USB: serial: option: fix FM101R-GL defines
Qu Wenruo (1):
btrfs: add dmesg output for first mount and last unmount of a filesystem
Quinn Tran (1):
scsi: qla2xxx: Fix system crash due to bad pointer access
Raju Rangoju (3):
amd-xgbe: handle corner-case during sfp hotplug
amd-xgbe: handle the corner-case during tx completion
amd-xgbe: propagate the correct speed and duplex status
Rand Deeb (1):
bcache: prevent potential division by zero error
Ricardo Ribalda (1):
usb: dwc3: set the dma max_seg_size
Samuel Holland (1):
net: axienet: Fix check for partial TX checksum
Saravana Kannan (1):
driver core: Release all resources during unbind before updating device links
Stefano Stabellini (1):
arm/xen: fix xen_vcpu_info allocation alignment
Steve French (1):
smb3: fix touch -h of symlink
Takashi Iwai (1):
ALSA: hda: Disable power-save on KONTRON SinglePC
Timothy Pearson (1):
powerpc: Don't clobber f0/vs0 during fp|altivec register save
Victor Fragoso (1):
USB: serial: option: add Fibocom L7xx modules
Wu Bo (1):
dm verity: don't perform FEC for failed readahead IO
Yang Yingliang (1):
firewire: core: fix possible memory leak in create_units()
Yoshihiro Shimoda (1):
ravb: Fix races between ravb_tx_timeout_work() and net related ops
Zhang Yi (1):
ext4: make sure allocate pending entry not fail
Zheng Yongjun (1):
mmc: core: convert comma to semicolon
Zhengchao Shao (1):
ipv4: igmp: fix refcnt uaf issue when receiving igmp query packet
commit 1aa3aaf8953c84bad398adf6c3cabc9d6685bf7d upstream
A transaction complete work is allocated and queued for each
transaction. Under certain conditions the work->type might be marked as
BINDER_WORK_TRANSACTION_ONEWAY_SPAM_SUSPECT to notify userspace about
potential spamming threads or as BINDER_WORK_TRANSACTION_PENDING when
the target is currently frozen.
However, these work types are not being handled in binder_release_work()
so they will leak during a cleanup. This was reported by syzkaller with
the following kmemleak dump:
BUG: memory leak
unreferenced object 0xffff88810e2d6de0 (size 32):
comm "syz-executor338", pid 5046, jiffies 4294968230 (age 13.590s)
hex dump (first 32 bytes):
e0 6d 2d 0e 81 88 ff ff e0 6d 2d 0e 81 88 ff ff .m-......m-.....
04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff81573b75>] kmalloc_trace+0x25/0x90 mm/slab_common.c:1114
[<ffffffff83d41873>] kmalloc include/linux/slab.h:599 [inline]
[<ffffffff83d41873>] kzalloc include/linux/slab.h:720 [inline]
[<ffffffff83d41873>] binder_transaction+0x573/0x4050 drivers/android/binder.c:3152
[<ffffffff83d45a05>] binder_thread_write+0x6b5/0x1860 drivers/android/binder.c:4010
[<ffffffff83d486dc>] binder_ioctl_write_read drivers/android/binder.c:5066 [inline]
[<ffffffff83d486dc>] binder_ioctl+0x1b2c/0x3cf0 drivers/android/binder.c:5352
[<ffffffff816b25f2>] vfs_ioctl fs/ioctl.c:51 [inline]
[<ffffffff816b25f2>] __do_sys_ioctl fs/ioctl.c:871 [inline]
[<ffffffff816b25f2>] __se_sys_ioctl fs/ioctl.c:857 [inline]
[<ffffffff816b25f2>] __x64_sys_ioctl+0xf2/0x140 fs/ioctl.c:857
[<ffffffff84b30008>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff84b30008>] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
[<ffffffff84c0008b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fix the leaks by kfreeing these work types in binder_release_work() and
handle them as a BINDER_WORK_TRANSACTION_COMPLETE cleanup.
Cc: stable(a)vger.kernel.org
Fixes: a7dc1e6f99df ("binder: tell userspace to dump current backtrace when detected oneway spamming")
Reported-by: syzbot+7f10c1653e35933c0f1e(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7f10c1653e35933c0f1e
Suggested-by: Alice Ryhl <aliceryhl(a)google.com>
Signed-off-by: Carlos Llamas <cmllamas(a)google.com>
Reviewed-by: Alice Ryhl <aliceryhl(a)google.com>
Acked-by: Todd Kjos <tkjos(a)google.com>
Link: https://lore.kernel.org/r/20230922175138.230331-1-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
[cmllamas: backport to v5.15 by dropping BINDER_WORK_TRANSACTION_PENDING
as commit 0567461a7a6e is not present. Remove fixes tag accordingly.]
Signed-off-by: Carlos Llamas <cmllamas(a)google.com>
---
drivers/android/binder.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index cbbed43baf05..b63322e7e101 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -4620,6 +4620,7 @@ static void binder_release_work(struct binder_proc *proc,
"undelivered TRANSACTION_ERROR: %u\n",
e->cmd);
} break;
+ case BINDER_WORK_TRANSACTION_ONEWAY_SPAM_SUSPECT:
case BINDER_WORK_TRANSACTION_COMPLETE: {
binder_debug(BINDER_DEBUG_DEAD_TRANSACTION,
"undelivered TRANSACTION_COMPLETE\n");
base-commit: 9b91d36ba301db86bbf9e783169f7f6abf2585d8
--
2.43.0.472.g3155946c3a-goog
With VRR, every atomic commit affecting a given display must trigger
a new scanout cycle, so that userspace is able to control the refresh
rate of the display. Before this commit, this was not the case for
atomic commits that only contain cursor plane properties.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3034
Cc: stable(a)vger.kernel.org
Signed-off-by: Xaver Hugl <xaver.hugl(a)gmail.com>
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index b452796fc6d3..b379c859fbef 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -8149,9 +8149,15 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
/* Cursor plane is handled after stream updates */
if (plane->type == DRM_PLANE_TYPE_CURSOR) {
if ((fb && crtc == pcrtc) ||
- (old_plane_state->fb && old_plane_state->crtc == pcrtc))
+ (old_plane_state->fb && old_plane_state->crtc == pcrtc)) {
cursor_update = true;
-
+ /*
+ * With atomic modesetting, cursor changes must
+ * also trigger a new refresh period with vrr
+ */
+ if (!state->legacy_cursor_update)
+ pflip_present = true;
+ }
continue;
}
--
2.43.0
From: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Since the plane_state variable is declared outside the scaler_users
loop in intel_atomic_setup_scalers(), and it's never reset back to
NULL inside the loop we may end up calling intel_atomic_setup_scaler()
with a non-NULL plane state for the pipe scaling case. That is bad
because intel_atomic_setup_scaler() determines whether we are doing
plane scaling or pipe scaling based on plane_state!=NULL. The end
result is that we may miscalculate the scaler mode for pipe scaling.
The hardware becomes somewhat upset if we end up in this situation
when scanning out a planar format on a SDR plane. We end up
programming the pipe scaler into planar mode as well, and the
result is a screenfull of garbage.
Fix the situation by making sure we pass the correct plane_state==NULL
when calculating the scaler mode for pipe scaling.
Cc: stable(a)vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
---
drivers/gpu/drm/i915/display/skl_scaler.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/display/skl_scaler.c b/drivers/gpu/drm/i915/display/skl_scaler.c
index 1e7c97243fcf..8a934bada624 100644
--- a/drivers/gpu/drm/i915/display/skl_scaler.c
+++ b/drivers/gpu/drm/i915/display/skl_scaler.c
@@ -504,7 +504,6 @@ int intel_atomic_setup_scalers(struct drm_i915_private *dev_priv,
{
struct drm_plane *plane = NULL;
struct intel_plane *intel_plane;
- struct intel_plane_state *plane_state = NULL;
struct intel_crtc_scaler_state *scaler_state =
&crtc_state->scaler_state;
struct drm_atomic_state *drm_state = crtc_state->uapi.state;
@@ -536,6 +535,7 @@ int intel_atomic_setup_scalers(struct drm_i915_private *dev_priv,
/* walkthrough scaler_users bits and start assigning scalers */
for (i = 0; i < sizeof(scaler_state->scaler_users) * 8; i++) {
+ struct intel_plane_state *plane_state = NULL;
int *scaler_id;
const char *name;
int idx, ret;
--
2.41.0
Mark reports that brightness is not restored after Xorg dpms screen blank.
This behavior was introduced by commit d9e865826c20 ("drm/amd/display:
Simplify brightness initialization") which dropped the cached backlight
value in display code, but also removed code for when the default value
read back was less than 1 nit.
Restore this code so that the backlight brightness is restored to the
correct default value in this circumstance.
Reported-by: Mark Herbert <mark.herbert42(a)gmail.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3031
Cc: stable(a)vger.kernel.org
Cc: Camille Cho <camille.cho(a)amd.com>
Cc: Krunoslav Kovac <krunoslav.kovac(a)amd.com>
Cc: Hamza Mahfooz <hamza.mahfooz(a)amd.com>
Fixes: d9e865826c20 ("drm/amd/display: Simplify brightness initialization")
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
---
.../amd/display/dc/link/protocols/link_edp_panel_control.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/dc/link/protocols/link_edp_panel_control.c b/drivers/gpu/drm/amd/display/dc/link/protocols/link_edp_panel_control.c
index ac0fa88b52a0..bf53a86ea817 100644
--- a/drivers/gpu/drm/amd/display/dc/link/protocols/link_edp_panel_control.c
+++ b/drivers/gpu/drm/amd/display/dc/link/protocols/link_edp_panel_control.c
@@ -287,8 +287,8 @@ bool set_default_brightness_aux(struct dc_link *link)
if (link && link->dpcd_sink_ext_caps.bits.oled == 1) {
if (!read_default_bl_aux(link, &default_backlight))
default_backlight = 150000;
- // if > 5000, it might be wrong readback
- if (default_backlight > 5000000)
+ // if < 1 nits or > 5000, it might be wrong readback
+ if (default_backlight < 1000 || default_backlight > 5000000)
default_backlight = 150000;
return edp_set_backlight_level_nits(link, true,
--
2.34.1
When destroying a vgic, we have rather cumbersome rules about
when slots_lock and config_lock are held, resulting in fun
buglets.
The first port of call is to simplify kvm_vgic_map_resources()
so that there is only one call to kvm_vgic_destroy() instead of
two, with the second only holding half of the locks.
For that, we kill the non-locking primitive and move the call
outside of the locking altogether. This doesn't change anything
(we re-acquire the locks and teardown the whole vgic), and
simplifies the code significantly.
Cc: stable(a)vger.kernel.org
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
---
arch/arm64/kvm/vgic/vgic-init.c | 29 ++++++++++++++---------------
1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index c8c3cb812783..ad7e86879eb9 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -382,26 +382,24 @@ void kvm_vgic_vcpu_destroy(struct kvm_vcpu *vcpu)
vgic_cpu->rd_iodev.base_addr = VGIC_ADDR_UNDEF;
}
-static void __kvm_vgic_destroy(struct kvm *kvm)
+void kvm_vgic_destroy(struct kvm *kvm)
{
struct kvm_vcpu *vcpu;
unsigned long i;
- lockdep_assert_held(&kvm->arch.config_lock);
+ mutex_lock(&kvm->slots_lock);
vgic_debug_destroy(kvm);
kvm_for_each_vcpu(i, vcpu, kvm)
kvm_vgic_vcpu_destroy(vcpu);
+ mutex_lock(&kvm->arch.config_lock);
+
kvm_vgic_dist_destroy(kvm);
-}
-void kvm_vgic_destroy(struct kvm *kvm)
-{
- mutex_lock(&kvm->arch.config_lock);
- __kvm_vgic_destroy(kvm);
mutex_unlock(&kvm->arch.config_lock);
+ mutex_unlock(&kvm->slots_lock);
}
/**
@@ -469,25 +467,26 @@ int kvm_vgic_map_resources(struct kvm *kvm)
type = VGIC_V3;
}
- if (ret) {
- __kvm_vgic_destroy(kvm);
+ if (ret)
goto out;
- }
+
dist->ready = true;
dist_base = dist->vgic_dist_base;
mutex_unlock(&kvm->arch.config_lock);
ret = vgic_register_dist_iodev(kvm, dist_base, type);
- if (ret) {
+ if (ret)
kvm_err("Unable to register VGIC dist MMIO regions\n");
- kvm_vgic_destroy(kvm);
- }
- mutex_unlock(&kvm->slots_lock);
- return ret;
+ goto out_slots;
out:
mutex_unlock(&kvm->arch.config_lock);
+out_slots:
mutex_unlock(&kvm->slots_lock);
+
+ if (ret)
+ kvm_vgic_destroy(kvm);
+
return ret;
}
--
2.39.2
The rtc on the mox shares its interrupt line with the moxtet bus. Set
the interrupt type to be consistent between both devices. This ensures
correct setup of the interrupt line regardless of probing order.
Signed-off-by: Sjoerd Simons <sjoerd(a)collabora.com>
Cc: stable(a)vger.kernel.org # v6.2+
Fixes: 21aad8ba615e ("arm64: dts: armada-3720-turris-mox: Add missing interrupt for RTC")
---
(no changes since v1)
arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts b/arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts
index 9eab2bb22134..805ef2d79b40 100644
--- a/arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts
+++ b/arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts
@@ -130,7 +130,7 @@ rtc@6f {
compatible = "microchip,mcp7940x";
reg = <0x6f>;
interrupt-parent = <&gpiosb>;
- interrupts = <5 0>; /* GPIO2_5 */
+ interrupts = <5 IRQ_TYPE_EDGE_FALLING>; /* GPIO2_5 */
};
};
--
2.43.0
The Turris Mox shares the moxtet IRQ with various devices on the board,
so mark the IRQ as shared in the driver as well.
Without this loading the module will fail with:
genirq: Flags mismatch irq 40. 00002002 (moxtet) vs. 00002080 (mcp7940x)
Signed-off-by: Sjoerd Simons <sjoerd(a)collabora.com>
Cc: stable(a)vger.kernel.org # v6.2+
---
(no changes since v1)
drivers/bus/moxtet.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/bus/moxtet.c b/drivers/bus/moxtet.c
index 5eb0fe73ddc4..48c18f95660a 100644
--- a/drivers/bus/moxtet.c
+++ b/drivers/bus/moxtet.c
@@ -755,7 +755,7 @@ static int moxtet_irq_setup(struct moxtet *moxtet)
moxtet->irq.masked = ~0;
ret = request_threaded_irq(moxtet->dev_irq, NULL, moxtet_irq_thread_fn,
- IRQF_ONESHOT, "moxtet", moxtet);
+ IRQF_SHARED | IRQF_ONESHOT, "moxtet", moxtet);
if (ret < 0)
goto err_free;
--
2.43.0
When RPMB was converted to a character device, it added support for
multiple RPMB partitions (Commit 97548575bef3 ("mmc: block: Convert RPMB
to a character device").
One of the changes in this commit was transforming the variable
target_part defined in __mmc_blk_ioctl_cmd into a bitmask.
This inadvertedly regressed the validation check done in
mmc_blk_part_switch_pre() and mmc_blk_part_switch_post().
This commit fixes that regression.
Fixes: 97548575bef3 ("mmc: block: Convert RPMB to a character device")
Signed-off-by: Jorge Ramirez-Ortiz <jorge(a)foundries.io>
Reviewed-by: Linus Walleij <linus.walleij(a)linaro.org>
Cc: <stable(a)vger.kernel.org> # v4.14+
---
v2:
fixes parenthesis around condition
v3:
adds stable to commit header
v4:
fixes the stable version to v4.14
adds Reviewed-by
drivers/mmc/core/block.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 152dfe593c43..13093d26bf81 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -851,9 +851,10 @@ static const struct block_device_operations mmc_bdops = {
static int mmc_blk_part_switch_pre(struct mmc_card *card,
unsigned int part_type)
{
+ const unsigned int mask = EXT_CSD_PART_CONFIG_ACC_RPMB;
int ret = 0;
- if (part_type == EXT_CSD_PART_CONFIG_ACC_RPMB) {
+ if ((part_type & mask) == mask) {
if (card->ext_csd.cmdq_en) {
ret = mmc_cmdq_disable(card);
if (ret)
@@ -868,9 +869,10 @@ static int mmc_blk_part_switch_pre(struct mmc_card *card,
static int mmc_blk_part_switch_post(struct mmc_card *card,
unsigned int part_type)
{
+ const unsigned int mask = EXT_CSD_PART_CONFIG_ACC_RPMB;
int ret = 0;
- if (part_type == EXT_CSD_PART_CONFIG_ACC_RPMB) {
+ if ((part_type & mask) == mask) {
mmc_retune_unpause(card->host);
if (card->reenable_cmdq && !card->ext_csd.cmdq_en)
ret = mmc_cmdq_enable(card);
@@ -3143,4 +3145,3 @@ module_exit(mmc_blk_exit);
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Multimedia Card (MMC) block device driver");
-
--
2.34.1
Hi,
On 2023-12-01 08:31:48 +0000, Zhang, Rui wrote:
> As a quick fix, I'm not going to fix the "potential issue" describes
> above because we have not seen a real problem caused by this yet.
>
> Can you please try the below patch to confirm if the problem is gone on
> your system?
> This patch falls back to the previous way as sent at
> https://lore.kernel.org/lkml/87pm4bp54z.ffs@tglx/T/
I've just spent a couple hours bisecting why upgrading to 6.7-rc4 left me with
just a single CPU core on my dual socket workstation.
before:
[ 0.000000] Linux version 6.6.0-andres-00003-g31255e072b2e ...
...
[ 0.022960] ACPI: Using ACPI (MADT) for SMP configuration information
...
[ 0.022968] smpboot: Allowing 40 CPUs, 0 hotplug CPUs
...
[ 0.345921] smpboot: CPU0: Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz (family: 0x6, model: 0x55, stepping: 0x7)
...
[ 0.347229] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9
[ 0.349082] .... node #1, CPUs: #10 #11 #12 #13 #14 #15 #16 #17 #18 #19
[ 0.003190] smpboot: CPU 10 Converting physical 0 to logical die 1
[ 0.361053] .... node #0, CPUs: #20 #21 #22 #23 #24 #25 #26 #27 #28 #29
[ 0.363990] .... node #1, CPUs: #30 #31 #32 #33 #34 #35 #36 #37 #38 #39
...
[ 0.370886] smp: Brought up 2 nodes, 40 CPUs
[ 0.370891] smpboot: Max logical packages: 2
[ 0.370896] smpboot: Total of 40 processors activated (200000.00 BogoMIPS)
[ 0.403905] node 0 deferred pages initialised in 32ms
[ 0.408865] node 1 deferred pages initialised in 37ms
after:
[ 0.000000] Linux version 6.6.0-andres-00004-gec9aedb2aa1a ...
...
[ 0.022935] ACPI: Using ACPI (MADT) for SMP configuration information
...
[ 0.022942] smpboot: Allowing 1 CPUs, 0 hotplug CPUs
...
[ 0.356424] smpboot: CPU0: Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz (family: 0x6, model: 0x55, stepping: 0x7)
...
[ 0.357098] smp: Bringing up secondary CPUs ...
[ 0.357107] smp: Brought up 2 nodes, 1 CPU
[ 0.357108] smpboot: Max logical packages: 1
[ 0.357110] smpboot: Total of 1 processors activated (5000.00 BogoMIPS)
[ 0.726283] node 0 deferred pages initialised in 368ms
[ 0.774704] node 1 deferred pages initialised in 418ms
There does seem to be something off with the ACPI data, when booting without
the patch, I do see messages like:
[ 0.715228] APIC: NR_CPUS/possible_cpus limit of 40 reached. Processor 40/0x7f00 ignored.
[ 0.715231] ACPI: Unable to map lapic to logical cpu number
But other than that, the system has worked for a couple years.
It's obviously not good to regress from 2x10/20 cores/threads to a single
core. I guess it's at least somewhat funny to imagine a 2 socket system with
a single core...
It seems particularly worrying that this patch has apparently been selected
for -stable:
https://lore.kernel.org/all/20231122153212.852040-2-sashal@kernel.org/
Even if it didn't have these unintended consequences, it seems like a commit
like this hardly is -stable material?
I've attached .config, dmesg of a boot with gec9aedb2aa1a and one with
gec9aedb2aa1a^.
Greetings,
Andres Freund