Buenos días:
Le escribo para hablarle sobre una de las mejores herramientas GPS en el mercado.
La herramienta, que me gustaría presentarle brevemente, dispone de muchas funciones útiles para su trabajo, que optimizan los procesos de transporte y le ayudan a realizar tareas de campo de manera más eficiente.
¿Quiere conocer los detalles?
Atentamente,
Antonio Valverde
Hello,
I am so sorry contacting you in this means especially when we have never
met before. I urgently seek your service to represent me in investing in
your region / country and you will be rewarded for your service without
affecting your present job with very little time invested in it.
My interest is in buying real estate, private schools or companies with
potentials for rapid growth in long terms.
So please confirm interest by responding back.
My dearest regards
Seyba Daniel
The patch titled
Subject: mm, page_alloc: check pfn is valid before moving to freelist
has been added to the -mm tree. Its filename is
mm-page_alloc-check-pfn-is-valid-before-moving-to-freelist.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-check-pfn-is-valid-…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-page_alloc-check-pfn-is-valid-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Sudarshan Rajagopalan <quic_sudaraja(a)quicinc.com>
Subject: mm, page_alloc: check pfn is valid before moving to freelist
Check if pfn is valid before or not before moving it to freelist.
There are possible scenario where a pageblock can have partial physical
hole and partial part of System RAM. This happens when base address in
RAM partition table is not aligned to pageblock size.
Example:
Say we have this first two entries in RAM partition table -
Base Addr: 0x0000000080000000 Length: 0x0000000058000000
Base Addr: 0x00000000E3930000 Length: 0x0000000020000000
...
Physical hole: 0xD8000000 - 0xE3930000
On system having 4K as page size and hence pageblock size being 4MB, the
base address 0xE3930000 is not aligned to 4MB pageblock size.
Now we will have pageblock which has partial physical hole and partial part
of System RAM -
Pageblock [0xE3800000 - 0xE3C00000] -
0xE3800000 - 0xE3930000 -- physical hole
0xE3930000 - 0xE3C00000 -- System RAM
Now doing __alloc_pages say we get a valid page with PFN 0xE3B00 from
__rmqueue_fallback, we try to put other pages from the same pageblock as well
into freelist by calling steal_suitable_fallback().
We then search for freepages from start of the pageblock due to below code -
move_freepages_block(zone, page, migratetype, ...)
{
pfn = page_to_pfn(page);
start_pfn = pfn & ~(pageblock_nr_pages - 1);
end_pfn = start_pfn + pageblock_nr_pages - 1;
...
}
With the pageblock which has partial physical hole at the beginning, we
will run into PFNs from the physical hole whose struct page is not
initialized and is invalid, and system would crash as we operate on
invalid struct page to find out of page is in Buddy or LRU or not
[ 107.629453][ T9688] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 107.639214][ T9688] Mem abort info:
[ 107.642829][ T9688] ESR = 0x96000006
[ 107.646696][ T9688] EC = 0x25: DABT (current EL), IL = 32 bits
[ 107.652878][ T9688] SET = 0, FnV = 0
[ 107.656751][ T9688] EA = 0, S1PTW = 0
[ 107.660705][ T9688] FSC = 0x06: level 2 translation fault
[ 107.666455][ T9688] Data abort info:
[ 107.670151][ T9688] ISV = 0, ISS = 0x00000006
[ 107.674827][ T9688] CM = 0, WnR = 0
[ 107.678615][ T9688] user pgtable: 4k pages, 39-bit VAs, pgdp=000000098a237000
[ 107.685970][ T9688] [0000000000000000] pgd=0800000987170003, p4d=0800000987170003, pud=0800000987170003, pmd=0000000000000000
[ 107.697582][ T9688] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[ 108.209839][ T9688] pc : move_freepages_block+0x174/0x27c
[ 108.215407][ T9688] lr : steal_suitable_fallback+0x20c/0x398
[ 108.305908][ T9688] Call trace:
[ 108.309151][ T9688] move_freepages_block+0x174/0x27c [PageLRU]
[ 108.314359][ T9688] steal_suitable_fallback+0x20c/0x398
[ 108.319826][ T9688] rmqueue_bulk+0x250/0x934
[ 108.324325][ T9688] rmqueue_pcplist+0x178/0x2ac
[ 108.329086][ T9688] rmqueue+0x5c/0xc10
[ 108.333048][ T9688] get_page_from_freelist+0x19c/0x430
[ 108.338430][ T9688] __alloc_pages+0x134/0x424
[ 108.343017][ T9688] page_cache_ra_unbounded+0x120/0x324
[ 108.348494][ T9688] do_sync_mmap_readahead+0x1b0/0x234
[ 108.353878][ T9688] filemap_fault+0xe0/0x4c8
[ 108.358375][ T9688] do_fault+0x168/0x6cc
[ 108.362518][ T9688] handle_mm_fault+0x5c4/0x848
[ 108.367280][ T9688] do_page_fault+0x3fc/0x5d0
[ 108.371867][ T9688] do_translation_fault+0x6c/0x1b0
[ 108.376985][ T9688] do_mem_abort+0x68/0x10c
[ 108.381389][ T9688] el0_ia+0x50/0xbc
[ 108.385175][ T9688] el0t_32_sync_handler+0x88/0xbc
[ 108.390208][ T9688] el0t_32_sync+0x1b8/0x1bc
Hence, avoid operating on invalid pages within the same pageblock by
checking if pfn is valid or not.
Link: https://lkml.kernel.org/r/fb3c8c008994b2ed96f74b6b9698ff998b689bd2.16497940…
Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja(a)quicinc.com>
Fixes: 859a85ddf90e7140 ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
Cc: Mike Rapoport <rppt(a)linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/page_alloc.c~mm-page_alloc-check-pfn-is-valid-before-moving-to-freelist
+++ a/mm/page_alloc.c
@@ -2521,6 +2521,11 @@ static int move_freepages(struct zone *z
int pages_moved = 0;
for (pfn = start_pfn; pfn <= end_pfn;) {
+ if (!pfn_valid(pfn)) {
+ pfn++;
+ continue;
+ }
+
page = pfn_to_page(pfn);
if (!PageBuddy(page)) {
/*
_
Patches currently in -mm which might be from quic_sudaraja(a)quicinc.com are
mm-page_alloc-check-pfn-is-valid-before-moving-to-freelist.patch
While the latent entropy plugin mostly doesn't derive entropy from
get_random_const() for measuring the call graph, when __latent_entropy is
applied to a constant, then it's initialized statically to output from
get_random_const(). In that case, this data is derived from a 64-bit
seed, which means a buffer of 512 bits doesn't really have that amount
of compile-time entropy.
This patch fixes that shortcoming by just buffering chunks of
/dev/urandom output and doling it out as requested.
At the same time, it's important that we don't break the use of
-frandom-seed, for people who want the runtime benefits of the latent
entropy plugin, while still having compile-time determinism. In that
case, we detect whether gcc's set_random_seed() has been called by
making a call to get_random_seed(noinit=true) in the plugin init
function, which is called after set_random_seed() is called but before
anything that calls get_random_seed(noinit=false), and seeing if it's
zero or not. If it's not zero, we're in deterministic mode, and so we
just generate numbers with a basic xorshift prng.
Fixes: 38addce8b600 ("gcc-plugins: Add latent_entropy plugin")
Cc: stable(a)vger.kernel.org
Cc: PaX Team <pageexec(a)freemail.hu>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
---
Changes v1->v2:
- Pipacs pointed out that using /dev/urandom unconditionally would break
the use of -frandom-seed, so now we check for that and keep with
something deterministic in that case.
I'm not super familiar with this plugin or its conventions, so pointers
would be most welcome if something here looks amiss. The decision to
buffer 2k at a time is pretty arbitrary too; I haven't measured usage.
scripts/gcc-plugins/latent_entropy_plugin.c | 48 +++++++++++++--------
1 file changed, 30 insertions(+), 18 deletions(-)
diff --git a/scripts/gcc-plugins/latent_entropy_plugin.c b/scripts/gcc-plugins/latent_entropy_plugin.c
index 589454bce930..042442013ae1 100644
--- a/scripts/gcc-plugins/latent_entropy_plugin.c
+++ b/scripts/gcc-plugins/latent_entropy_plugin.c
@@ -82,29 +82,37 @@ __visible int plugin_is_GPL_compatible;
static GTY(()) tree latent_entropy_decl;
static struct plugin_info latent_entropy_plugin_info = {
- .version = "201606141920vanilla",
+ .version = "202203311920vanilla",
.help = "disable\tturn off latent entropy instrumentation\n",
};
-static unsigned HOST_WIDE_INT seed;
-/*
- * get_random_seed() (this is a GCC function) generates the seed.
- * This is a simple random generator without any cryptographic security because
- * the entropy doesn't come from here.
- */
+static unsigned HOST_WIDE_INT deterministic_seed;
+static unsigned HOST_WIDE_INT rnd_buf[256];
+static size_t rnd_idx = ARRAY_SIZE(rnd_buf);
+static int urandom_fd = -1;
+
static unsigned HOST_WIDE_INT get_random_const(void)
{
- unsigned int i;
- unsigned HOST_WIDE_INT ret = 0;
-
- for (i = 0; i < 8 * sizeof(ret); i++) {
- ret = (ret << 1) | (seed & 1);
- seed >>= 1;
- if (ret & 1)
- seed ^= 0xD800000000000000ULL;
+ if (deterministic_seed) {
+ unsigned HOST_WIDE_INT w = deterministic_seed;
+ w ^= w << 13;
+ w ^= w >> 7;
+ w ^= w << 17;
+ deterministic_seed = w;
+ return deterministic_seed;
}
- return ret;
+ if (urandom_fd < 0) {
+ urandom_fd = open("/dev/urandom", O_RDONLY);
+ if (urandom_fd < 0)
+ abort();
+ }
+ if (rnd_idx >= ARRAY_SIZE(rnd_buf)) {
+ if (read(urandom_fd, rnd_buf, sizeof(rnd_buf)) != sizeof(rnd_buf))
+ abort();
+ rnd_idx = 0;
+ }
+ return rnd_buf[rnd_idx++];
}
static tree tree_get_random_const(tree type)
@@ -537,8 +545,6 @@ static void latent_entropy_start_unit(void *gcc_data __unused,
tree type, id;
int quals;
- seed = get_random_seed(false);
-
if (in_lto_p)
return;
@@ -573,6 +579,12 @@ __visible int plugin_init(struct plugin_name_args *plugin_info,
const struct plugin_argument * const argv = plugin_info->argv;
int i;
+ /*
+ * Call get_random_seed() with noinit=true, so that this returns
+ * 0 in the case where no seed has been passed via -frandom-seed.
+ */
+ deterministic_seed = get_random_seed(true);
+
static const struct ggc_root_tab gt_ggc_r_gt_latent_entropy[] = {
{
.base = &latent_entropy_decl,
--
2.35.1
[BUG]
If we hit an error from submit_extent_page() inside
__extent_writepage_io(), we could still return 0 to the caller, and
even trigger the warning in btrfs_page_assert_not_dirty().
[CAUSE]
In __extent_writepage_io(), if we hit an error from
submit_extent_page(), we will just clean up the range and continue.
This is completely fine for regular PAGE_SIZE == sectorsize, as we can
only hit one sector in one page, thus after the error we're ensured to
exit and @ret will be saved.
But for subpage case, we may have other dirty subpage range in the page,
and in the next loop, we may succeeded submitting the next range.
In that case, @ret will be overwritten, and we return 0 to the caller,
while we have hit some error.
[FIX]
Introduce @has_error and @saved_ret to record the first error we hit, so
we will never forget what error we hit.
CC: stable(a)vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
fs/btrfs/extent_io.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 990d8475ba31..df4e78ff3b18 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3955,10 +3955,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
u64 extent_offset;
u64 block_start;
struct extent_map *em;
+ int saved_ret = 0;
int ret = 0;
int nr = 0;
u32 opf = REQ_OP_WRITE;
const unsigned int write_flags = wbc_to_write_flags(wbc);
+ bool has_error = false;
bool compressed;
ret = btrfs_writepage_cow_fixup(page);
@@ -4008,6 +4010,9 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
if (IS_ERR(em)) {
btrfs_page_set_error(fs_info, page, cur, end - cur + 1);
ret = PTR_ERR_OR_ZERO(em);
+ has_error = true;
+ if (!saved_ret)
+ saved_ret = ret;
break;
}
@@ -4071,6 +4076,10 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
end_bio_extent_writepage,
0, 0, false);
if (ret) {
+ has_error = true;
+ if (!saved_ret)
+ saved_ret = ret;
+
btrfs_page_set_error(fs_info, page, cur, iosize);
if (PageWriteback(page))
btrfs_page_clear_writeback(fs_info, page, cur,
@@ -4084,8 +4093,10 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
* If we finish without problem, we should not only clear page dirty,
* but also empty subpage dirty bits
*/
- if (!ret)
+ if (!has_error)
btrfs_page_assert_not_dirty(fs_info, page);
+ else
+ ret = saved_ret;
*nr_ret = nr;
return ret;
}
--
2.35.1
[BUG]
Test case generic/475 have a very high chance (almost 100%) to hit a fs
hang, where a data page will never be unlocked and hang all later
operations.
[CAUSE]
In btrfs_do_readpage(), if we hit an error from submit_extent_page() we
will try to do the cleanup for our current io range, and exit.
This works fine for PAGE_SIZE == sectorsize cases, but not for subpage.
For subpage btrfs_do_readpage() will lock the full page first, which can
contain several different sectors and extents:
btrfs_do_readpage()
|- begin_page_read()
| |- btrfs_subpage_start_reader();
| Now the page will hage PAGE_SIZE / sectorsize reader pending,
| and the page is locked.
|
|- end_page_read() for different branches
| This function will reduce subpage readers, and when readers
| reach 0, it will unlock the page.
But when submit_extent_page() failed, we only cleanup the current
io range, while the remaining io range will never be cleaned up, and the
page remains locked forever.
[FIX]
Update the error handling of submit_extent_page() to cleanup all the
remaining subpage range before exiting the loop.
Please note that, now submit_extent_page() can only fail due to
sanity check in alloc_new_bio().
Thus regular IO errors are impossible to trigger the error path.
CC: stable(a)vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
fs/btrfs/extent_io.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 163aa6dee987..990d8475ba31 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3774,8 +3774,12 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
this_bio_flag,
force_bio_submit);
if (ret) {
- unlock_extent(tree, cur, cur + iosize - 1);
- end_page_read(page, false, cur, iosize);
+ /*
+ * We have to unlock the remaining range, or the page
+ * will never be unlocked.
+ */
+ unlock_extent(tree, cur, end);
+ end_page_read(page, false, cur, end + 1 - cur);
goto out;
}
cur = cur + iosize;
--
2.35.1
Checks are presently in place in validate_nla() to ensure strings
greater than 2 are not passed in by the user which could potentially
cause issues.
However, there is nothing to prevent userspace from only providing a
single (1) Byte as the data length parameter via nla_put(). If this
were to happen, it would cause an OOB read in regulatory_hint_user(),
since it makes assumptions that alpha2[0] and alpha2[1] will always be
accessible.
Add an additional check, to ensure enough data has been allocated to
hold both Bytes.
Cc: <stable(a)vger.kernel.org>
Cc: Johannes Berg <johannes(a)sipsolutions.net>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: linux-wireless(a)vger.kernel.org
Cc: netdev(a)vger.kernel.org
Signed-off-by: Lee Jones <lee.jones(a)linaro.org>
---
net/wireless/nl80211.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index ee1c2b6b69711..80a516033db36 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -7536,6 +7536,10 @@ static int nl80211_req_set_reg(struct sk_buff *skb, struct genl_info *info)
if (!info->attrs[NL80211_ATTR_REG_ALPHA2])
return -EINVAL;
+ if (nla_len(info->attrs[NL80211_ATTR_REG_ALPHA2]) !=
+ nl80211_policy[NL80211_ATTR_REG_ALPHA2].len)
+ return -EINVAL;
+
data = nla_data(info->attrs[NL80211_ATTR_REG_ALPHA2]);
return regulatory_hint_user(data, user_reg_hint_type);
case NL80211_USER_REG_HINT_INDOOR:
--
2.35.1.1094.g7c7d902a7c-goog
From: Lin Ma <linma(a)zju.edu.cn>
commit 0b9111922b1f399aba6ed1e1b8f2079c3da1aed8 upstream.
There is a possible race condition (use-after-free) like below
(USE) | (FREE)
dev_queue_xmit |
__dev_queue_xmit |
__dev_xmit_skb |
sch_direct_xmit | ...
xmit_one |
netdev_start_xmit | tty_ldisc_kill
__netdev_start_xmit | 6pack_close
sp_xmit | kfree
sp_encaps |
|
According to the patch "defer ax25 kfree after unregister_netdev", this
patch reorder the kfree after the unregister_netdev to avoid the possible
UAF as the unregister_netdev() is well synchronized and won't return if
there is a running routine.
Signed-off-by: Lin Ma <linma(a)zju.edu.cn>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Xu Jia <xujia39(a)huawei.com>
---
drivers/net/hamradio/6pack.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
index 02d6f3a..82507a6 100644
--- a/drivers/net/hamradio/6pack.c
+++ b/drivers/net/hamradio/6pack.c
@@ -679,9 +679,11 @@ static void sixpack_close(struct tty_struct *tty)
del_timer_sync(&sp->tx_t);
del_timer_sync(&sp->resync_t);
- /* Free all 6pack frame buffers. */
+ /* Free all 6pack frame buffers after unreg. */
kfree(sp->rbuff);
kfree(sp->xbuff);
+
+ free_netdev(sp->dev);
}
/* Perform I/O control on an active 6pack channel. */
--
1.8.3.1
I'm announcing the release of the 4.9.310 kernel.
All users of the 4.9 kernel series must upgrade.
The updated 4.9.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.9.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/arm64/silicon-errata.txt | 1
Documentation/kernel-parameters.txt | 9
Makefile | 2
arch/arm/include/asm/kvm_host.h | 5
arch/arm/kvm/psci.c | 4
arch/arm64/Kconfig | 24 +
arch/arm64/include/asm/arch_timer.h | 44 +-
arch/arm64/include/asm/assembler.h | 34 +
arch/arm64/include/asm/cpu.h | 1
arch/arm64/include/asm/cpucaps.h | 4
arch/arm64/include/asm/cpufeature.h | 232 +++++++++++-
arch/arm64/include/asm/cputype.h | 63 +++
arch/arm64/include/asm/fixmap.h | 6
arch/arm64/include/asm/insn.h | 2
arch/arm64/include/asm/kvm_host.h | 4
arch/arm64/include/asm/kvm_mmu.h | 2
arch/arm64/include/asm/mmu.h | 8
arch/arm64/include/asm/processor.h | 6
arch/arm64/include/asm/sections.h | 6
arch/arm64/include/asm/sysreg.h | 5
arch/arm64/include/asm/vectors.h | 74 ++++
arch/arm64/kernel/bpi.S | 55 +++
arch/arm64/kernel/cpu_errata.c | 595 ++++++++++++++++++++++++++-------
arch/arm64/kernel/cpufeature.c | 167 ++++++---
arch/arm64/kernel/cpuinfo.c | 1
arch/arm64/kernel/entry.S | 197 ++++++++--
arch/arm64/kernel/fpsimd.c | 1
arch/arm64/kernel/insn.c | 29 +
arch/arm64/kernel/smp.c | 6
arch/arm64/kernel/traps.c | 4
arch/arm64/kernel/vmlinux.lds.S | 2
arch/arm64/kvm/hyp/hyp-entry.S | 4
arch/arm64/kvm/hyp/switch.c | 9
arch/arm64/mm/fault.c | 17
arch/arm64/mm/mmu.c | 11
drivers/clocksource/Kconfig | 4
drivers/clocksource/arm_arch_timer.c | 192 ++++++++--
include/linux/arm-smccc.h | 7
38 files changed, 1507 insertions(+), 330 deletions(-)
Anshuman Khandual (1):
arm64: Add Cortex-X2 CPU part definition
Arnd Bergmann (1):
arm64: arch_timer: avoid unused function warning
Dave Martin (1):
arm64: capabilities: Update prototype for enable call back
Ding Tianhong (2):
clocksource/drivers/arm_arch_timer: Remove fsl-a008585 parameter
clocksource/drivers/arm_arch_timer: Introduce generic errata handling infrastructure
Greg Kroah-Hartman (1):
Linux 4.9.310
James Morse (20):
arm64: Remove useless UAO IPI and describe how this gets enabled
arm64: entry.S: Add ventry overflow sanity checks
arm64: entry: Make the trampoline cleanup optional
arm64: entry: Free up another register on kpti's tramp_exit path
arm64: entry: Move the trampoline data page before the text page
arm64: entry: Allow tramp_alias to access symbols after the 4K boundary
arm64: entry: Don't assume tramp_vectors is the start of the vectors
arm64: entry: Move trampoline macros out of ifdef'd section
arm64: entry: Make the kpti trampoline's kpti sequence optional
arm64: entry: Allow the trampoline text to occupy multiple pages
arm64: entry: Add non-kpti __bp_harden_el1_vectors for mitigations
arm64: Move arm64_update_smccc_conduit() out of SSBD ifdef
arm64: entry: Add vectors that have the bhb mitigation sequences
arm64: entry: Add macro for reading symbol addresses from the trampoline
arm64: Add percpu vectors for EL1
KVM: arm64: Add templates for BHB mitigation sequences
arm64: Mitigate spectre style branch history side channels
KVM: arm64: Allow SMCCC_ARCH_WORKAROUND_3 to be discovered and migrated
arm64: add ID_AA64ISAR2_EL1 sys register
arm64: Use the clearbhb instruction in mitigations
Marc Zyngier (6):
arm64: arch_timer: Add infrastructure for multiple erratum detection methods
arm64: arch_timer: Add erratum handler for CPU-specific capability
arm64: arch_timer: Add workaround for ARM erratum 1188873
arm64: Add silicon-errata.txt entry for ARM erratum 1188873
arm64: Make ARM64_ERRATUM_1188873 depend on COMPAT
arm64: Add part number for Neoverse N1
Rob Herring (1):
arm64: Add part number for Arm Cortex-A77
Robert Richter (1):
arm64: errata: Provide macro for major and minor cpu revisions
Suzuki K Poulose (10):
arm64: Add MIDR encoding for Arm Cortex-A55 and Cortex-A35
arm64: capabilities: Move errata work around check on boot CPU
arm64: capabilities: Move errata processing code
arm64: capabilities: Prepare for fine grained capabilities
arm64: capabilities: Add flags to handle the conflicts on late CPU
arm64: capabilities: Clean up midr range helpers
arm64: Add helpers for checking CPU MIDR against a range
arm64: capabilities: Add support for checks based on a list of MIDRs
arm64: Add Neoverse-N2, Cortex-A710 CPU part definition
arm64: Add helper to decode register from instruction
This is the start of the stable review cycle for the 4.9.310 release.
There are 43 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri, 08 Apr 2022 18:24:27 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.310-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.310-rc1
James Morse <james.morse(a)arm.com>
arm64: Use the clearbhb instruction in mitigations
James Morse <james.morse(a)arm.com>
arm64: add ID_AA64ISAR2_EL1 sys register
James Morse <james.morse(a)arm.com>
KVM: arm64: Allow SMCCC_ARCH_WORKAROUND_3 to be discovered and migrated
James Morse <james.morse(a)arm.com>
arm64: Mitigate spectre style branch history side channels
James Morse <james.morse(a)arm.com>
KVM: arm64: Add templates for BHB mitigation sequences
James Morse <james.morse(a)arm.com>
arm64: Add percpu vectors for EL1
James Morse <james.morse(a)arm.com>
arm64: entry: Add macro for reading symbol addresses from the trampoline
James Morse <james.morse(a)arm.com>
arm64: entry: Add vectors that have the bhb mitigation sequences
James Morse <james.morse(a)arm.com>
arm64: Move arm64_update_smccc_conduit() out of SSBD ifdef
James Morse <james.morse(a)arm.com>
arm64: entry: Add non-kpti __bp_harden_el1_vectors for mitigations
James Morse <james.morse(a)arm.com>
arm64: entry: Allow the trampoline text to occupy multiple pages
James Morse <james.morse(a)arm.com>
arm64: entry: Make the kpti trampoline's kpti sequence optional
James Morse <james.morse(a)arm.com>
arm64: entry: Move trampoline macros out of ifdef'd section
James Morse <james.morse(a)arm.com>
arm64: entry: Don't assume tramp_vectors is the start of the vectors
James Morse <james.morse(a)arm.com>
arm64: entry: Allow tramp_alias to access symbols after the 4K boundary
James Morse <james.morse(a)arm.com>
arm64: entry: Move the trampoline data page before the text page
James Morse <james.morse(a)arm.com>
arm64: entry: Free up another register on kpti's tramp_exit path
James Morse <james.morse(a)arm.com>
arm64: entry: Make the trampoline cleanup optional
James Morse <james.morse(a)arm.com>
arm64: entry.S: Add ventry overflow sanity checks
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: Add helper to decode register from instruction
Anshuman Khandual <anshuman.khandual(a)arm.com>
arm64: Add Cortex-X2 CPU part definition
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: Add Neoverse-N2, Cortex-A710 CPU part definition
Rob Herring <robh(a)kernel.org>
arm64: Add part number for Arm Cortex-A77
Marc Zyngier <marc.zyngier(a)arm.com>
arm64: Add part number for Neoverse N1
Marc Zyngier <marc.zyngier(a)arm.com>
arm64: Make ARM64_ERRATUM_1188873 depend on COMPAT
Marc Zyngier <marc.zyngier(a)arm.com>
arm64: Add silicon-errata.txt entry for ARM erratum 1188873
Arnd Bergmann <arnd(a)arndb.de>
arm64: arch_timer: avoid unused function warning
Marc Zyngier <marc.zyngier(a)arm.com>
arm64: arch_timer: Add workaround for ARM erratum 1188873
Marc Zyngier <marc.zyngier(a)arm.com>
arm64: arch_timer: Add erratum handler for CPU-specific capability
Marc Zyngier <marc.zyngier(a)arm.com>
arm64: arch_timer: Add infrastructure for multiple erratum detection methods
Ding Tianhong <dingtianhong(a)huawei.com>
clocksource/drivers/arm_arch_timer: Introduce generic errata handling infrastructure
Ding Tianhong <dingtianhong(a)huawei.com>
clocksource/drivers/arm_arch_timer: Remove fsl-a008585 parameter
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: capabilities: Add support for checks based on a list of MIDRs
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: Add helpers for checking CPU MIDR against a range
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: capabilities: Clean up midr range helpers
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: capabilities: Add flags to handle the conflicts on late CPU
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: capabilities: Prepare for fine grained capabilities
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: capabilities: Move errata processing code
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: capabilities: Move errata work around check on boot CPU
Dave Martin <dave.martin(a)arm.com>
arm64: capabilities: Update prototype for enable call back
Suzuki K Poulose <suzuki.poulose(a)arm.com>
arm64: Add MIDR encoding for Arm Cortex-A55 and Cortex-A35
James Morse <james.morse(a)arm.com>
arm64: Remove useless UAO IPI and describe how this gets enabled
Robert Richter <rrichter(a)cavium.com>
arm64: errata: Provide macro for major and minor cpu revisions
-------------
Diffstat:
Documentation/arm64/silicon-errata.txt | 1 +
Documentation/kernel-parameters.txt | 9 -
Makefile | 4 +-
arch/arm/include/asm/kvm_host.h | 5 +
arch/arm/kvm/psci.c | 4 +
arch/arm64/Kconfig | 24 ++
arch/arm64/include/asm/arch_timer.h | 44 ++-
arch/arm64/include/asm/assembler.h | 34 ++
arch/arm64/include/asm/cpu.h | 1 +
arch/arm64/include/asm/cpucaps.h | 4 +-
arch/arm64/include/asm/cpufeature.h | 232 ++++++++++++-
arch/arm64/include/asm/cputype.h | 63 ++++
arch/arm64/include/asm/fixmap.h | 6 +-
arch/arm64/include/asm/insn.h | 2 +
arch/arm64/include/asm/kvm_host.h | 4 +
arch/arm64/include/asm/kvm_mmu.h | 2 +-
arch/arm64/include/asm/mmu.h | 8 +-
arch/arm64/include/asm/processor.h | 6 +-
arch/arm64/include/asm/sections.h | 6 +
arch/arm64/include/asm/sysreg.h | 5 +
arch/arm64/include/asm/vectors.h | 74 +++++
arch/arm64/kernel/bpi.S | 55 +++
arch/arm64/kernel/cpu_errata.c | 591 ++++++++++++++++++++++++++-------
arch/arm64/kernel/cpufeature.c | 167 +++++++---
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/entry.S | 197 ++++++++---
arch/arm64/kernel/fpsimd.c | 1 +
arch/arm64/kernel/insn.c | 29 ++
arch/arm64/kernel/smp.c | 6 -
arch/arm64/kernel/traps.c | 4 +-
arch/arm64/kernel/vmlinux.lds.S | 2 +-
arch/arm64/kvm/hyp/hyp-entry.S | 4 +
arch/arm64/kvm/hyp/switch.c | 9 +-
arch/arm64/mm/fault.c | 17 +-
arch/arm64/mm/mmu.c | 11 +-
drivers/clocksource/Kconfig | 4 +
drivers/clocksource/arm_arch_timer.c | 192 ++++++++---
include/linux/arm-smccc.h | 7 +
38 files changed, 1504 insertions(+), 331 deletions(-)
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0df6664531a12cdd8fc873f0cac0dcb40243d3e9 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Tue, 15 Mar 2022 16:50:32 +0000
Subject: [PATCH] irqchip/gic-v3: Fix GICR_CTLR.RWP polling
It turns out that our polling of RWP is totally wrong when checking
for it in the redistributors, as we test the *distributor* bit index,
whereas it is a different bit number in the RDs... Oopsie boo.
This is embarassing. Not only because it is wrong, but also because
it took *8 years* to notice the blunder...
Just fix the damn thing.
Fixes: 021f653791ad ("irqchip: gic-v3: Initial support for GICv3")
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
Reviewed-by: Andre Przywara <andre.przywara(a)arm.com>
Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi(a)arm.com>
Link: https://lore.kernel.org/r/20220315165034.794482-2-maz@kernel.org
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 0efe1a9a9f3b..9b6316582515 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -206,11 +206,11 @@ static inline void __iomem *gic_dist_base(struct irq_data *d)
}
}
-static void gic_do_wait_for_rwp(void __iomem *base)
+static void gic_do_wait_for_rwp(void __iomem *base, u32 bit)
{
u32 count = 1000000; /* 1s! */
- while (readl_relaxed(base + GICD_CTLR) & GICD_CTLR_RWP) {
+ while (readl_relaxed(base + GICD_CTLR) & bit) {
count--;
if (!count) {
pr_err_ratelimited("RWP timeout, gone fishing\n");
@@ -224,13 +224,13 @@ static void gic_do_wait_for_rwp(void __iomem *base)
/* Wait for completion of a distributor change */
static void gic_dist_wait_for_rwp(void)
{
- gic_do_wait_for_rwp(gic_data.dist_base);
+ gic_do_wait_for_rwp(gic_data.dist_base, GICD_CTLR_RWP);
}
/* Wait for completion of a redistributor change */
static void gic_redist_wait_for_rwp(void)
{
- gic_do_wait_for_rwp(gic_data_rdist_rd_base());
+ gic_do_wait_for_rwp(gic_data_rdist_rd_base(), GICR_CTLR_RWP);
}
#ifdef CONFIG_ARM64
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0df6664531a12cdd8fc873f0cac0dcb40243d3e9 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Tue, 15 Mar 2022 16:50:32 +0000
Subject: [PATCH] irqchip/gic-v3: Fix GICR_CTLR.RWP polling
It turns out that our polling of RWP is totally wrong when checking
for it in the redistributors, as we test the *distributor* bit index,
whereas it is a different bit number in the RDs... Oopsie boo.
This is embarassing. Not only because it is wrong, but also because
it took *8 years* to notice the blunder...
Just fix the damn thing.
Fixes: 021f653791ad ("irqchip: gic-v3: Initial support for GICv3")
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
Reviewed-by: Andre Przywara <andre.przywara(a)arm.com>
Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi(a)arm.com>
Link: https://lore.kernel.org/r/20220315165034.794482-2-maz@kernel.org
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 0efe1a9a9f3b..9b6316582515 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -206,11 +206,11 @@ static inline void __iomem *gic_dist_base(struct irq_data *d)
}
}
-static void gic_do_wait_for_rwp(void __iomem *base)
+static void gic_do_wait_for_rwp(void __iomem *base, u32 bit)
{
u32 count = 1000000; /* 1s! */
- while (readl_relaxed(base + GICD_CTLR) & GICD_CTLR_RWP) {
+ while (readl_relaxed(base + GICD_CTLR) & bit) {
count--;
if (!count) {
pr_err_ratelimited("RWP timeout, gone fishing\n");
@@ -224,13 +224,13 @@ static void gic_do_wait_for_rwp(void __iomem *base)
/* Wait for completion of a distributor change */
static void gic_dist_wait_for_rwp(void)
{
- gic_do_wait_for_rwp(gic_data.dist_base);
+ gic_do_wait_for_rwp(gic_data.dist_base, GICD_CTLR_RWP);
}
/* Wait for completion of a redistributor change */
static void gic_redist_wait_for_rwp(void)
{
- gic_do_wait_for_rwp(gic_data_rdist_rd_base());
+ gic_do_wait_for_rwp(gic_data_rdist_rd_base(), GICR_CTLR_RWP);
}
#ifdef CONFIG_ARM64
From: Hongyu Xie <xiehongyu1(a)kylinos.cn>
pl2303.c doesn't have reset_resume for hibernation.
So needs_binding will be set to 1 duiring hibernation.
usb_forced_unbind_intf will be called, and the port minor
will be released (x in ttyUSBx).
It works fine if you have only one USB-to-serial device.
Assume you have 2 USB-to-serial device, nameing A and B.
A gets a smaller minor(ttyUSB0), B gets a bigger one.
And start to hibernate. When your PC is in hibernation,
unplug device A. Then wake up your PC by pressing the
power button. After waking up the whole system, device
B gets ttyUSB0. This will casuse a problem if you were
using those to ports(like opened two minicom process)
before hibernation.
So member reset_resume is needed in usb_serial_driver
pl2303_device.
Codes in pl2303_reset_resume are borrowed from pl2303_open.
As a matter of fact, all driver under drivers/usb/serial
has the same problem except ch341.c.
Cc: stable(a)vger.kernel.org
Signed-off-by: Hongyu Xie <xiehongyu1(a)kylinos.cn>
Reported-by: sheng.huang <sheng.huang(a)ecastech.com>
---
drivers/usb/serial/pl2303.c | 48 +++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/drivers/usb/serial/pl2303.c b/drivers/usb/serial/pl2303.c
index 88b284d61681..7cc05123b88c 100644
--- a/drivers/usb/serial/pl2303.c
+++ b/drivers/usb/serial/pl2303.c
@@ -1218,6 +1218,53 @@ static void pl2303_process_read_urb(struct urb *urb)
tty_flip_buffer_push(&port->port);
}
+static int pl2303_configure(struct usb_serial *serial, struct pl2303_serial_private *priv)
+{
+ struct usb_serial_port *port = serial->port[0];
+
+ if (priv->quirks & PL2303_QUIRK_LEGACY) {
+ usb_clear_halt(serial->dev, port->write_urb->pipe);
+ usb_clear_halt(serial->dev, port->read_urb->pipe);
+ } else {
+ /* reset upstream data pipes */
+ if (priv->type == &pl2303_type_data[TYPE_HXN])
+ pl2303_vendor_write(serial, PL2303_HXN_RESET_REG,
+ PL2303_HXN_RESET_UPSTREAM_PIPE |
+ PL2303_HXN_RESET_DOWNSTREAM_PIPE);
+ else {
+ pl2303_vendor_write(serial, 8, 0);
+ pl2303_vendor_write(serial, 9, 0);
+ }
+ }
+ return 0;
+}
+
+static int pl2303_reset_resume(struct usb_serial *serial)
+{
+ struct usb_serial_port *port = serial->port[0];
+ struct pl2303_serial_private *priv = usb_get_serial_port_data(port);
+ struct tty_struct *tty = tty_port_tty_get(&port->port);
+ int ret;
+
+ /* reconfigure pl2303 serial port after bus-reset */
+ pl2303_configure(serial, priv);
+
+ /* Setup termios */
+ if (tty)
+ pl2303_set_termios(tty, port, NULL);
+
+ if (tty_port_initialized(&port->port)) {
+ ret = usb_submit_urb(port->interrupt_in_urb, GFP_NOIO);
+ if (ret) {
+ dev_err(&port->dev, "failed to submit interrupt urb: %d\n",
+ ret);
+ return ret;
+ }
+ }
+
+ return usb_serial_generic_resume(serial);
+}
+
static struct usb_serial_driver pl2303_device = {
.driver = {
.owner = THIS_MODULE,
@@ -1246,6 +1293,7 @@ static struct usb_serial_driver pl2303_device = {
.release = pl2303_release,
.port_probe = pl2303_port_probe,
.port_remove = pl2303_port_remove,
+ .reset_resume = pl2303_reset_resume,
};
static struct usb_serial_driver * const serial_drivers[] = {
--
2.25.1
This set fixes a process hang issue in btrfs and gf2 filesystems. When we
do a direct IO read or write when the buffer given by the user is
memory-mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the test case generic/647 from
fstests.
This fix depends on the iov_iter and iomap changes introduced in the
commit c03098d4b9ad ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2") and they
are part of this set for stable-5.15.y.
Please note that patch 3/17 in this patchset changes the prototype and
renames an exported symbol as below. All its references are updated as
well.
-EXPORT_SYMBOL(iov_iter_fault_in_readable);
+EXPORT_SYMBOL(fault_in_iov_iter_readable);
Andreas Gruenbacher (15):
powerpc/kvm: Fix kvm_use_magic_page
gup: Turn fault_in_pages_{readable,writeable} into
fault_in_{readable,writeable}
iov_iter: Turn iov_iter_fault_in_readable into
fault_in_iov_iter_readable
iov_iter: Introduce fault_in_iov_iter_writeable
gfs2: Add wrapper for iomap_file_buffered_write
gfs2: Clean up function may_grant
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Eliminate ip->i_gh
gfs2: Fix mmap + page fault deadlocks for buffered I/O
iomap: Fix iomap_dio_rw return value for user copies
iomap: Support partial direct I/O on user copy failures
iomap: Add done_before argument to iomap_dio_rw
gup: Introduce FOLL_NOFAULT flag to disable page faults
iov_iter: Introduce nofault flag to disable page faults
gfs2: Fix mmap + page fault deadlocks for direct I/O
Bob Peterson (1):
gfs2: Introduce flag for glock holder auto-demotion
Filipe Manana (1):
btrfs: fix deadlock due to page faults during direct IO reads and
writes
arch/powerpc/kernel/kvm.c | 3 +-
arch/powerpc/kernel/signal_32.c | 4 +-
arch/powerpc/kernel/signal_64.c | 2 +-
arch/x86/kernel/fpu/signal.c | 7 +-
drivers/gpu/drm/armada/armada_gem.c | 7 +-
fs/btrfs/file.c | 142 ++++++++++--
fs/btrfs/ioctl.c | 5 +-
fs/erofs/data.c | 2 +-
fs/ext4/file.c | 5 +-
fs/f2fs/file.c | 2 +-
fs/fuse/file.c | 2 +-
fs/gfs2/bmap.c | 60 +----
fs/gfs2/file.c | 252 +++++++++++++++++++--
fs/gfs2/glock.c | 330 +++++++++++++++++++++-------
fs/gfs2/glock.h | 20 ++
fs/gfs2/incore.h | 4 +-
fs/iomap/buffered-io.c | 2 +-
fs/iomap/direct-io.c | 29 ++-
fs/ntfs/file.c | 2 +-
fs/ntfs3/file.c | 2 +-
fs/xfs/xfs_file.c | 6 +-
fs/zonefs/super.c | 4 +-
include/linux/iomap.h | 11 +-
include/linux/mm.h | 3 +-
include/linux/pagemap.h | 58 +----
include/linux/uio.h | 4 +-
lib/iov_iter.c | 98 +++++++--
mm/filemap.c | 4 +-
mm/gup.c | 139 +++++++++++-
29 files changed, 911 insertions(+), 298 deletions(-)
--
2.33.1
+CC stable
On 03/01/22 15:24, tip-bot2 for Valentin Schneider wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> Gitweb: https://git.kernel.org/tip/fa2c3254d7cfff5f7a916ab928a562d1165f17bb
> Author: Valentin Schneider <valentin.schneider(a)arm.com>
> AuthorDate: Thu, 20 Jan 2022 16:25:19
> Committer: Peter Zijlstra <peterz(a)infradead.org>
> CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00
>
> sched/tracing: Don't re-read p->state when emitting sched_switch event
>
> As of commit
>
> c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
>
> the following sequence becomes possible:
>
> p->__state = TASK_INTERRUPTIBLE;
> __schedule()
> deactivate_task(p);
> ttwu()
> READ !p->on_rq
> p->__state=TASK_WAKING
> trace_sched_switch()
> __trace_sched_switch_state()
> task_state_index()
> return 0;
>
> TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
> the trace event.
>
> Prevent this by pushing the value read from __schedule() down the trace
> event.
>
> Reported-by: Abhijeet Dharmapurikar <adharmap(a)quicinc.com>
> Signed-off-by: Valentin Schneider <valentin.schneider(a)arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
> Reviewed-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
> Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
Any objection to picking this for stable? I'm interested in this one for some
Android users but prefer if it can be taken by stable rather than backport it
individually.
I think it makes sense to pick the next one in the series too.
Thanks!
--
Qais Yousef
+CC stable
On 01/20/22 16:25, Valentin Schneider wrote:
> TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
>
> The actual state is saved in p->saved_state, but reading it after reading
> p->__state has a few issues:
> o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> o ttwu_state_match() might have changed that to TASK_RUNNING
>
> As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> exposing a new state to userspace tools which way not know what to do with
> them. The only information that needs to be conveyed here is that a task is
> waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> need for a new state.
>
> Reported-by: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
> Signed-off-by: Valentin Schneider <valentin.schneider(a)arm.com>
Any objection for this to be picked up by stable? We care about Patch 1 only in
this series for stable, but it seems sensible to pick this one too, no strong
feeling if it is omitted though.
AFAICT it seems the problem dates back since commit:
1593baab910d ("sched/debug: Implement consistent task-state printing")
or even before. I think v4.14+ is good enough.
Thoughts?
Thanks!
--
Qais Yousef
Hi everyone,
These two patches resolve two instances of -Wstrict-prototypes with
newer versions of clang that are present in 5.4. The main Makefile makes
this a hard error.
The first patch is upstream commit 63617d8b125e ("drm/amdkfd: add
missing void argument to function kgd2kfd_init"), which showed up in
5.5.
The second patch has no upstream equivalent, as the code in question was
removed in commit e392c887df97 ("drm/amdkfd: Use array to probe
kfd2kgd_calls") upstream, which is part of a larger series that did not
look reasonable for stable. I opted to just fix the warning in the same
manner as the prior patch, which is less risky and accomplishes the same
end result of no warning.
Colin Ian King (1):
drm/amdkfd: add missing void argument to function kgd2kfd_init
Nathan Chancellor (1):
drm/amdkfd: Fix -Wstrict-prototypes from
amdgpu_amdkfd_gfx_10_0_get_functions()
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_module.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
base-commit: 2845ff3fd34499603249676495c524a35e795b45
--
2.35.1
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5b6547ed97f4f5dfc23f8e3970af6d11d7b7ed7e Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 16 Mar 2022 22:03:41 +0100
Subject: [PATCH] sched/core: Fix forceidle balancing
Steve reported that ChromeOS encounters the forceidle balancer being
ran from rt_mutex_setprio()'s balance_callback() invocation and
explodes.
Now, the forceidle balancer gets queued every time the idle task gets
selected, set_next_task(), which is strictly too often.
rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:
queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
running = task_current(rq, p); /* rq->curr == p */
if (queued)
dequeue_task(...);
if (running)
put_prev_task(...);
/* change task properties */
if (queued)
enqueue_task(...);
if (running)
set_next_task(...);
However, rt_mutex_setprio() will explicitly not run this pattern on
the idle task (since priority boosting the idle task is quite insane).
Most other 'change' pattern users are pidhash based and would also not
apply to idle.
Also, the change pattern doesn't contain a __balance_callback()
invocation and hence we could have an out-of-band balance-callback,
which *should* trigger the WARN in rq_pin_lock() (which guards against
this exact anti-pattern).
So while none of that explains how this happens, it does indicate that
having it in set_next_task() might not be the most robust option.
Instead, explicitly queue the forceidle balancer from pick_next_task()
when it does indeed result in forceidle selection. Having it here,
ensures it can only be triggered under the __schedule() rq->lock
instance, and hence must be ran from that context.
This also happens to clean up the code a little, so win-win.
Fixes: d2dfa17bc7de ("sched: Trivial forced-newidle balancer")
Reported-by: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Tested-by: T.J. Alumbaugh <talumbau(a)chromium.org>
Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-a…
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d575b4914925..017ee7807930 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5752,6 +5752,8 @@ static inline struct task_struct *pick_task(struct rq *rq)
extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
+static void queue_core_balance(struct rq *rq);
+
static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
@@ -5801,7 +5803,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
rq->core_pick = NULL;
- return next;
+ goto out;
}
put_prev_task_balance(rq, prev, rf);
@@ -5851,7 +5853,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
WARN_ON_ONCE(fi_before);
task_vruntime_update(rq, next, false);
- goto done;
+ goto out_set_next;
}
}
@@ -5970,8 +5972,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
resched_curr(rq_i);
}
-done:
+out_set_next:
set_next_task(rq, next);
+out:
+ if (rq->core->core_forceidle_count && next == rq->idle)
+ queue_core_balance(rq);
+
return next;
}
@@ -6066,7 +6072,7 @@ static void sched_core_balance(struct rq *rq)
static DEFINE_PER_CPU(struct callback_head, core_balance_head);
-void queue_core_balance(struct rq *rq)
+static void queue_core_balance(struct rq *rq)
{
if (!sched_core_enabled(rq))
return;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8f8b5020e76a..ecb0d7052877 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -434,7 +434,6 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
{
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
- queue_core_balance(rq);
}
#ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 58263f90c559..8dccb34eb190 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1232,8 +1232,6 @@ static inline bool sched_group_cookie_match(struct rq *rq,
return false;
}
-extern void queue_core_balance(struct rq *rq);
-
static inline bool sched_core_enqueued(struct task_struct *p)
{
return !RB_EMPTY_NODE(&p->core_node);
@@ -1267,10 +1265,6 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
return &rq->__lock;
}
-static inline void queue_core_balance(struct rq *rq)
-{
-}
-
static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
{
return true;
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5b6547ed97f4f5dfc23f8e3970af6d11d7b7ed7e Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 16 Mar 2022 22:03:41 +0100
Subject: [PATCH] sched/core: Fix forceidle balancing
Steve reported that ChromeOS encounters the forceidle balancer being
ran from rt_mutex_setprio()'s balance_callback() invocation and
explodes.
Now, the forceidle balancer gets queued every time the idle task gets
selected, set_next_task(), which is strictly too often.
rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:
queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
running = task_current(rq, p); /* rq->curr == p */
if (queued)
dequeue_task(...);
if (running)
put_prev_task(...);
/* change task properties */
if (queued)
enqueue_task(...);
if (running)
set_next_task(...);
However, rt_mutex_setprio() will explicitly not run this pattern on
the idle task (since priority boosting the idle task is quite insane).
Most other 'change' pattern users are pidhash based and would also not
apply to idle.
Also, the change pattern doesn't contain a __balance_callback()
invocation and hence we could have an out-of-band balance-callback,
which *should* trigger the WARN in rq_pin_lock() (which guards against
this exact anti-pattern).
So while none of that explains how this happens, it does indicate that
having it in set_next_task() might not be the most robust option.
Instead, explicitly queue the forceidle balancer from pick_next_task()
when it does indeed result in forceidle selection. Having it here,
ensures it can only be triggered under the __schedule() rq->lock
instance, and hence must be ran from that context.
This also happens to clean up the code a little, so win-win.
Fixes: d2dfa17bc7de ("sched: Trivial forced-newidle balancer")
Reported-by: Steven Rostedt <rostedt(a)goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Tested-by: T.J. Alumbaugh <talumbau(a)chromium.org>
Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-a…
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d575b4914925..017ee7807930 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5752,6 +5752,8 @@ static inline struct task_struct *pick_task(struct rq *rq)
extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
+static void queue_core_balance(struct rq *rq);
+
static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
@@ -5801,7 +5803,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
rq->core_pick = NULL;
- return next;
+ goto out;
}
put_prev_task_balance(rq, prev, rf);
@@ -5851,7 +5853,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
WARN_ON_ONCE(fi_before);
task_vruntime_update(rq, next, false);
- goto done;
+ goto out_set_next;
}
}
@@ -5970,8 +5972,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
resched_curr(rq_i);
}
-done:
+out_set_next:
set_next_task(rq, next);
+out:
+ if (rq->core->core_forceidle_count && next == rq->idle)
+ queue_core_balance(rq);
+
return next;
}
@@ -6066,7 +6072,7 @@ static void sched_core_balance(struct rq *rq)
static DEFINE_PER_CPU(struct callback_head, core_balance_head);
-void queue_core_balance(struct rq *rq)
+static void queue_core_balance(struct rq *rq)
{
if (!sched_core_enabled(rq))
return;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8f8b5020e76a..ecb0d7052877 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -434,7 +434,6 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
{
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
- queue_core_balance(rq);
}
#ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 58263f90c559..8dccb34eb190 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1232,8 +1232,6 @@ static inline bool sched_group_cookie_match(struct rq *rq,
return false;
}
-extern void queue_core_balance(struct rq *rq);
-
static inline bool sched_core_enqueued(struct task_struct *p)
{
return !RB_EMPTY_NODE(&p->core_node);
@@ -1267,10 +1265,6 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
return &rq->__lock;
}
-static inline void queue_core_balance(struct rq *rq)
-{
-}
-
static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
{
return true;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1273f89578f268ea705ddbad60c4bd2dcff80611 Mon Sep 17 00:00:00 2001
From: Ivan Vecera <ivecera(a)redhat.com>
Date: Thu, 31 Mar 2022 09:20:08 -0700
Subject: [PATCH] ice: Fix broken IFF_ALLMULTI handling
Handling of all-multicast flag and associated multicast promiscuous
mode is broken in ice driver. When an user switches allmulticast
flag on or off the driver checks whether any VLANs are configured
over the interface (except default VLAN 0).
If any extra VLANs are registered it enables multicast promiscuous
mode for all these VLANs (including default VLAN 0) using
ICE_SW_LKUP_PROMISC_VLAN look-up type. In this situation all
multicast packets tagged with known VLAN ID or untagged are received
and multicast packets tagged with unknown VLAN ID ignored.
If no extra VLANs are registered (so only VLAN 0 exists) it enables
multicast promiscuous mode for VLAN 0 and uses ICE_SW_LKUP_PROMISC
look-up type. In this situation any multicast packets including
tagged ones are received.
The driver handles IFF_ALLMULTI in ice_vsi_sync_fltr() this way:
ice_vsi_sync_fltr() {
...
if (changed_flags & IFF_ALLMULTI) {
if (netdev->flags & IFF_ALLMULTI) {
if (vsi->num_vlans > 1)
ice_set_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS);
} else {
if (vsi->num_vlans > 1)
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS);
}
}
...
}
The code above depends on value vsi->num_vlan that specifies number
of VLANs configured over the interface (including VLAN 0) and
this is problem because that value is modified in NDO callbacks
ice_vlan_rx_add_vid() and ice_vlan_rx_kill_vid().
Scenario 1:
1. ip link set ens7f0 allmulticast on
2. ip link add vlan10 link ens7f0 type vlan id 10
3. ip link set ens7f0 allmulticast off
4. ip link set ens7f0 allmulticast on
[1] In this scenario IFF_ALLMULTI is enabled and the driver calls
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS) that installs
multicast promisc rule with non-VLAN look-up type.
[2] Then VLAN with ID 10 is added and vsi->num_vlan incremented to 2
[3] Command switches IFF_ALLMULTI off and the driver calls
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS) but this
call is effectively NOP because it looks for multicast promisc
rules for VLAN 0 and VLAN 10 with VLAN look-up type but no such
rules exist. So the all-multicast remains enabled silently
in hardware.
[4] Command tries to switch IFF_ALLMULTI on and the driver calls
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS) but this call
fails (-EEXIST) because non-VLAN multicast promisc rule already
exists.
Scenario 2:
1. ip link add vlan10 link ens7f0 type vlan id 10
2. ip link set ens7f0 allmulticast on
3. ip link add vlan20 link ens7f0 type vlan id 20
4. ip link del vlan10 ; ip link del vlan20
5. ip link set ens7f0 allmulticast off
[1] VLAN with ID 10 is added and vsi->num_vlan==2
[2] Command switches IFF_ALLMULTI on and driver installs multicast
promisc rules with VLAN look-up type for VLAN 0 and 10
[3] VLAN with ID 20 is added and vsi->num_vlan==3 but no multicast
promisc rules is added for this new VLAN so the interface does
not receive MC packets from VLAN 20
[4] Both VLANs are removed but multicast rule for VLAN 10 remains
installed so interface receives multicast packets from VLAN 10
[5] Command switches IFF_ALLMULTI off and because vsi->num_vlan is 1
the driver tries to remove multicast promisc rule for VLAN 0
with non-VLAN look-up that does not exist.
All-multicast looks disabled from user point of view but it
is partially enabled in HW (interface receives all multicast
packets either untagged or tagged with VLAN ID 10)
To resolve these issues the patch introduces these changes:
1. Adds handling for IFF_ALLMULTI to ice_vlan_rx_add_vid() and
ice_vlan_rx_kill_vid() callbacks. So when VLAN is added/removed
and IFF_ALLMULTI is enabled an appropriate multicast promisc
rule for that VLAN ID is added/removed.
2. In ice_vlan_rx_add_vid() when first VLAN besides VLAN 0 is added
so (vsi->num_vlan == 2) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_VLAN_PROMISC_BITS.
3. In ice_vlan_rx_kill_vid() when last VLAN besides VLAN 0 is removed
so (vsi->num_vlan == 1) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_PROMISC_BITS.
4. Both ice_vlan_rx_{add,kill}_vid() have to run under ICE_CFG_BUSY
bit protection to avoid races with ice_vsi_sync_fltr() that runs
in ice_service_task() context.
5. Bit ICE_VSI_VLAN_FLTR_CHANGED is use-less and can be removed.
6. Error messages added to ice_fltr_*_vsi_promisc() helper functions
to avoid them in their callers
7. Small improvements to increase readability
Fixes: 5eda8afd6bcc ("ice: Add support for PF/VF promiscuous mode")
Signed-off-by: Ivan Vecera <ivecera(a)redhat.com>
Reviewed-by: Jacob Keller <jacob.e.keller(a)intel.com>
Signed-off-by: Alice Michael <alice.michael(a)intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index d4f1874df7d0..26eaee0b6503 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -301,7 +301,6 @@ enum ice_vsi_state {
ICE_VSI_NETDEV_REGISTERED,
ICE_VSI_UMAC_FLTR_CHANGED,
ICE_VSI_MMAC_FLTR_CHANGED,
- ICE_VSI_VLAN_FLTR_CHANGED,
ICE_VSI_PROMISC_CHANGED,
ICE_VSI_STATE_NBITS /* must be last */
};
diff --git a/drivers/net/ethernet/intel/ice/ice_fltr.c b/drivers/net/ethernet/intel/ice/ice_fltr.c
index af57eb114966..85a94483c2ed 100644
--- a/drivers/net/ethernet/intel/ice/ice_fltr.c
+++ b/drivers/net/ethernet/intel/ice/ice_fltr.c
@@ -58,7 +58,16 @@ int
ice_fltr_set_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -73,7 +82,16 @@ int
ice_fltr_clear_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -87,7 +105,16 @@ int
ice_fltr_clear_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
@@ -101,7 +128,16 @@ int
ice_fltr_set_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index d755ce07869f..1d2ca39add95 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -243,8 +243,7 @@ static int ice_add_mac_to_unsync_list(struct net_device *netdev, const u8 *addr)
static bool ice_vsi_fltr_changed(struct ice_vsi *vsi)
{
return test_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
}
/**
@@ -260,10 +259,15 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -280,10 +284,15 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -302,7 +311,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
struct ice_pf *pf = vsi->back;
struct ice_hw *hw = &pf->hw;
u32 changed_flags = 0;
- u8 promisc_m;
int err;
if (!vsi->netdev)
@@ -320,7 +328,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
if (ice_vsi_fltr_changed(vsi)) {
clear_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state);
clear_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
- clear_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
/* grab the netdev's addr_list_lock */
netif_addr_lock_bh(netdev);
@@ -369,29 +376,15 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
/* check for changes in promiscuous modes */
if (changed_flags & IFF_ALLMULTI) {
if (vsi->current_netdev_flags & IFF_ALLMULTI) {
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_set_promisc(vsi, promisc_m);
+ err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error setting Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags &= ~IFF_ALLMULTI;
goto out_promisc;
}
} else {
/* !(vsi->current_netdev_flags & IFF_ALLMULTI) */
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_clear_promisc(vsi, promisc_m);
+ err = ice_clear_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error clearing Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags |= IFF_ALLMULTI;
goto out_promisc;
}
@@ -3488,6 +3481,20 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
+ /* Add multicast promisc rule for the VLAN ID to be added if
+ * all-multicast is currently enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ vid);
+ if (ret)
+ goto finish;
+ }
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Add a switch rule for this VLAN ID so its corresponding VLAN tagged
@@ -3495,8 +3502,23 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
*/
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->add_vlan(vsi, &vlan);
- if (!ret)
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ if (ret)
+ goto finish;
+
+ /* If all-multicast is currently enabled and this VLAN ID is only one
+ * besides VLAN-0 we have to update look-up type of multicast promisc
+ * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
+ */
+ if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+ ice_vsi_num_non_zero_vlans(vsi) == 1) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, 0);
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
return ret;
}
@@ -3522,6 +3544,9 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Make sure VLAN delete is successful before updating VLAN
@@ -3530,10 +3555,33 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->del_vlan(vsi, &vlan);
if (ret)
- return ret;
+ goto finish;
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
- return 0;
+ /* Remove multicast promisc rule for the removed VLAN ID if
+ * all-multicast is enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI)
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, vid);
+
+ if (!ice_vsi_has_non_zero_vlans(vsi)) {
+ /* Update look-up type of multicast promisc rule for VLAN 0
+ * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
+ * all-multicast is enabled and VLAN 0 is the only VLAN rule.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ }
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
+
+ return ret;
}
/**
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1273f89578f268ea705ddbad60c4bd2dcff80611 Mon Sep 17 00:00:00 2001
From: Ivan Vecera <ivecera(a)redhat.com>
Date: Thu, 31 Mar 2022 09:20:08 -0700
Subject: [PATCH] ice: Fix broken IFF_ALLMULTI handling
Handling of all-multicast flag and associated multicast promiscuous
mode is broken in ice driver. When an user switches allmulticast
flag on or off the driver checks whether any VLANs are configured
over the interface (except default VLAN 0).
If any extra VLANs are registered it enables multicast promiscuous
mode for all these VLANs (including default VLAN 0) using
ICE_SW_LKUP_PROMISC_VLAN look-up type. In this situation all
multicast packets tagged with known VLAN ID or untagged are received
and multicast packets tagged with unknown VLAN ID ignored.
If no extra VLANs are registered (so only VLAN 0 exists) it enables
multicast promiscuous mode for VLAN 0 and uses ICE_SW_LKUP_PROMISC
look-up type. In this situation any multicast packets including
tagged ones are received.
The driver handles IFF_ALLMULTI in ice_vsi_sync_fltr() this way:
ice_vsi_sync_fltr() {
...
if (changed_flags & IFF_ALLMULTI) {
if (netdev->flags & IFF_ALLMULTI) {
if (vsi->num_vlans > 1)
ice_set_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS);
} else {
if (vsi->num_vlans > 1)
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS);
}
}
...
}
The code above depends on value vsi->num_vlan that specifies number
of VLANs configured over the interface (including VLAN 0) and
this is problem because that value is modified in NDO callbacks
ice_vlan_rx_add_vid() and ice_vlan_rx_kill_vid().
Scenario 1:
1. ip link set ens7f0 allmulticast on
2. ip link add vlan10 link ens7f0 type vlan id 10
3. ip link set ens7f0 allmulticast off
4. ip link set ens7f0 allmulticast on
[1] In this scenario IFF_ALLMULTI is enabled and the driver calls
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS) that installs
multicast promisc rule with non-VLAN look-up type.
[2] Then VLAN with ID 10 is added and vsi->num_vlan incremented to 2
[3] Command switches IFF_ALLMULTI off and the driver calls
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS) but this
call is effectively NOP because it looks for multicast promisc
rules for VLAN 0 and VLAN 10 with VLAN look-up type but no such
rules exist. So the all-multicast remains enabled silently
in hardware.
[4] Command tries to switch IFF_ALLMULTI on and the driver calls
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS) but this call
fails (-EEXIST) because non-VLAN multicast promisc rule already
exists.
Scenario 2:
1. ip link add vlan10 link ens7f0 type vlan id 10
2. ip link set ens7f0 allmulticast on
3. ip link add vlan20 link ens7f0 type vlan id 20
4. ip link del vlan10 ; ip link del vlan20
5. ip link set ens7f0 allmulticast off
[1] VLAN with ID 10 is added and vsi->num_vlan==2
[2] Command switches IFF_ALLMULTI on and driver installs multicast
promisc rules with VLAN look-up type for VLAN 0 and 10
[3] VLAN with ID 20 is added and vsi->num_vlan==3 but no multicast
promisc rules is added for this new VLAN so the interface does
not receive MC packets from VLAN 20
[4] Both VLANs are removed but multicast rule for VLAN 10 remains
installed so interface receives multicast packets from VLAN 10
[5] Command switches IFF_ALLMULTI off and because vsi->num_vlan is 1
the driver tries to remove multicast promisc rule for VLAN 0
with non-VLAN look-up that does not exist.
All-multicast looks disabled from user point of view but it
is partially enabled in HW (interface receives all multicast
packets either untagged or tagged with VLAN ID 10)
To resolve these issues the patch introduces these changes:
1. Adds handling for IFF_ALLMULTI to ice_vlan_rx_add_vid() and
ice_vlan_rx_kill_vid() callbacks. So when VLAN is added/removed
and IFF_ALLMULTI is enabled an appropriate multicast promisc
rule for that VLAN ID is added/removed.
2. In ice_vlan_rx_add_vid() when first VLAN besides VLAN 0 is added
so (vsi->num_vlan == 2) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_VLAN_PROMISC_BITS.
3. In ice_vlan_rx_kill_vid() when last VLAN besides VLAN 0 is removed
so (vsi->num_vlan == 1) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_PROMISC_BITS.
4. Both ice_vlan_rx_{add,kill}_vid() have to run under ICE_CFG_BUSY
bit protection to avoid races with ice_vsi_sync_fltr() that runs
in ice_service_task() context.
5. Bit ICE_VSI_VLAN_FLTR_CHANGED is use-less and can be removed.
6. Error messages added to ice_fltr_*_vsi_promisc() helper functions
to avoid them in their callers
7. Small improvements to increase readability
Fixes: 5eda8afd6bcc ("ice: Add support for PF/VF promiscuous mode")
Signed-off-by: Ivan Vecera <ivecera(a)redhat.com>
Reviewed-by: Jacob Keller <jacob.e.keller(a)intel.com>
Signed-off-by: Alice Michael <alice.michael(a)intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index d4f1874df7d0..26eaee0b6503 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -301,7 +301,6 @@ enum ice_vsi_state {
ICE_VSI_NETDEV_REGISTERED,
ICE_VSI_UMAC_FLTR_CHANGED,
ICE_VSI_MMAC_FLTR_CHANGED,
- ICE_VSI_VLAN_FLTR_CHANGED,
ICE_VSI_PROMISC_CHANGED,
ICE_VSI_STATE_NBITS /* must be last */
};
diff --git a/drivers/net/ethernet/intel/ice/ice_fltr.c b/drivers/net/ethernet/intel/ice/ice_fltr.c
index af57eb114966..85a94483c2ed 100644
--- a/drivers/net/ethernet/intel/ice/ice_fltr.c
+++ b/drivers/net/ethernet/intel/ice/ice_fltr.c
@@ -58,7 +58,16 @@ int
ice_fltr_set_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -73,7 +82,16 @@ int
ice_fltr_clear_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -87,7 +105,16 @@ int
ice_fltr_clear_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
@@ -101,7 +128,16 @@ int
ice_fltr_set_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index d755ce07869f..1d2ca39add95 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -243,8 +243,7 @@ static int ice_add_mac_to_unsync_list(struct net_device *netdev, const u8 *addr)
static bool ice_vsi_fltr_changed(struct ice_vsi *vsi)
{
return test_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
}
/**
@@ -260,10 +259,15 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -280,10 +284,15 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -302,7 +311,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
struct ice_pf *pf = vsi->back;
struct ice_hw *hw = &pf->hw;
u32 changed_flags = 0;
- u8 promisc_m;
int err;
if (!vsi->netdev)
@@ -320,7 +328,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
if (ice_vsi_fltr_changed(vsi)) {
clear_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state);
clear_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
- clear_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
/* grab the netdev's addr_list_lock */
netif_addr_lock_bh(netdev);
@@ -369,29 +376,15 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
/* check for changes in promiscuous modes */
if (changed_flags & IFF_ALLMULTI) {
if (vsi->current_netdev_flags & IFF_ALLMULTI) {
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_set_promisc(vsi, promisc_m);
+ err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error setting Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags &= ~IFF_ALLMULTI;
goto out_promisc;
}
} else {
/* !(vsi->current_netdev_flags & IFF_ALLMULTI) */
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_clear_promisc(vsi, promisc_m);
+ err = ice_clear_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error clearing Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags |= IFF_ALLMULTI;
goto out_promisc;
}
@@ -3488,6 +3481,20 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
+ /* Add multicast promisc rule for the VLAN ID to be added if
+ * all-multicast is currently enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ vid);
+ if (ret)
+ goto finish;
+ }
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Add a switch rule for this VLAN ID so its corresponding VLAN tagged
@@ -3495,8 +3502,23 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
*/
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->add_vlan(vsi, &vlan);
- if (!ret)
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ if (ret)
+ goto finish;
+
+ /* If all-multicast is currently enabled and this VLAN ID is only one
+ * besides VLAN-0 we have to update look-up type of multicast promisc
+ * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
+ */
+ if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+ ice_vsi_num_non_zero_vlans(vsi) == 1) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, 0);
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
return ret;
}
@@ -3522,6 +3544,9 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Make sure VLAN delete is successful before updating VLAN
@@ -3530,10 +3555,33 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->del_vlan(vsi, &vlan);
if (ret)
- return ret;
+ goto finish;
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
- return 0;
+ /* Remove multicast promisc rule for the removed VLAN ID if
+ * all-multicast is enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI)
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, vid);
+
+ if (!ice_vsi_has_non_zero_vlans(vsi)) {
+ /* Update look-up type of multicast promisc rule for VLAN 0
+ * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
+ * all-multicast is enabled and VLAN 0 is the only VLAN rule.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ }
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
+
+ return ret;
}
/**
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1273f89578f268ea705ddbad60c4bd2dcff80611 Mon Sep 17 00:00:00 2001
From: Ivan Vecera <ivecera(a)redhat.com>
Date: Thu, 31 Mar 2022 09:20:08 -0700
Subject: [PATCH] ice: Fix broken IFF_ALLMULTI handling
Handling of all-multicast flag and associated multicast promiscuous
mode is broken in ice driver. When an user switches allmulticast
flag on or off the driver checks whether any VLANs are configured
over the interface (except default VLAN 0).
If any extra VLANs are registered it enables multicast promiscuous
mode for all these VLANs (including default VLAN 0) using
ICE_SW_LKUP_PROMISC_VLAN look-up type. In this situation all
multicast packets tagged with known VLAN ID or untagged are received
and multicast packets tagged with unknown VLAN ID ignored.
If no extra VLANs are registered (so only VLAN 0 exists) it enables
multicast promiscuous mode for VLAN 0 and uses ICE_SW_LKUP_PROMISC
look-up type. In this situation any multicast packets including
tagged ones are received.
The driver handles IFF_ALLMULTI in ice_vsi_sync_fltr() this way:
ice_vsi_sync_fltr() {
...
if (changed_flags & IFF_ALLMULTI) {
if (netdev->flags & IFF_ALLMULTI) {
if (vsi->num_vlans > 1)
ice_set_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS);
} else {
if (vsi->num_vlans > 1)
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS);
}
}
...
}
The code above depends on value vsi->num_vlan that specifies number
of VLANs configured over the interface (including VLAN 0) and
this is problem because that value is modified in NDO callbacks
ice_vlan_rx_add_vid() and ice_vlan_rx_kill_vid().
Scenario 1:
1. ip link set ens7f0 allmulticast on
2. ip link add vlan10 link ens7f0 type vlan id 10
3. ip link set ens7f0 allmulticast off
4. ip link set ens7f0 allmulticast on
[1] In this scenario IFF_ALLMULTI is enabled and the driver calls
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS) that installs
multicast promisc rule with non-VLAN look-up type.
[2] Then VLAN with ID 10 is added and vsi->num_vlan incremented to 2
[3] Command switches IFF_ALLMULTI off and the driver calls
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS) but this
call is effectively NOP because it looks for multicast promisc
rules for VLAN 0 and VLAN 10 with VLAN look-up type but no such
rules exist. So the all-multicast remains enabled silently
in hardware.
[4] Command tries to switch IFF_ALLMULTI on and the driver calls
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS) but this call
fails (-EEXIST) because non-VLAN multicast promisc rule already
exists.
Scenario 2:
1. ip link add vlan10 link ens7f0 type vlan id 10
2. ip link set ens7f0 allmulticast on
3. ip link add vlan20 link ens7f0 type vlan id 20
4. ip link del vlan10 ; ip link del vlan20
5. ip link set ens7f0 allmulticast off
[1] VLAN with ID 10 is added and vsi->num_vlan==2
[2] Command switches IFF_ALLMULTI on and driver installs multicast
promisc rules with VLAN look-up type for VLAN 0 and 10
[3] VLAN with ID 20 is added and vsi->num_vlan==3 but no multicast
promisc rules is added for this new VLAN so the interface does
not receive MC packets from VLAN 20
[4] Both VLANs are removed but multicast rule for VLAN 10 remains
installed so interface receives multicast packets from VLAN 10
[5] Command switches IFF_ALLMULTI off and because vsi->num_vlan is 1
the driver tries to remove multicast promisc rule for VLAN 0
with non-VLAN look-up that does not exist.
All-multicast looks disabled from user point of view but it
is partially enabled in HW (interface receives all multicast
packets either untagged or tagged with VLAN ID 10)
To resolve these issues the patch introduces these changes:
1. Adds handling for IFF_ALLMULTI to ice_vlan_rx_add_vid() and
ice_vlan_rx_kill_vid() callbacks. So when VLAN is added/removed
and IFF_ALLMULTI is enabled an appropriate multicast promisc
rule for that VLAN ID is added/removed.
2. In ice_vlan_rx_add_vid() when first VLAN besides VLAN 0 is added
so (vsi->num_vlan == 2) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_VLAN_PROMISC_BITS.
3. In ice_vlan_rx_kill_vid() when last VLAN besides VLAN 0 is removed
so (vsi->num_vlan == 1) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_PROMISC_BITS.
4. Both ice_vlan_rx_{add,kill}_vid() have to run under ICE_CFG_BUSY
bit protection to avoid races with ice_vsi_sync_fltr() that runs
in ice_service_task() context.
5. Bit ICE_VSI_VLAN_FLTR_CHANGED is use-less and can be removed.
6. Error messages added to ice_fltr_*_vsi_promisc() helper functions
to avoid them in their callers
7. Small improvements to increase readability
Fixes: 5eda8afd6bcc ("ice: Add support for PF/VF promiscuous mode")
Signed-off-by: Ivan Vecera <ivecera(a)redhat.com>
Reviewed-by: Jacob Keller <jacob.e.keller(a)intel.com>
Signed-off-by: Alice Michael <alice.michael(a)intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index d4f1874df7d0..26eaee0b6503 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -301,7 +301,6 @@ enum ice_vsi_state {
ICE_VSI_NETDEV_REGISTERED,
ICE_VSI_UMAC_FLTR_CHANGED,
ICE_VSI_MMAC_FLTR_CHANGED,
- ICE_VSI_VLAN_FLTR_CHANGED,
ICE_VSI_PROMISC_CHANGED,
ICE_VSI_STATE_NBITS /* must be last */
};
diff --git a/drivers/net/ethernet/intel/ice/ice_fltr.c b/drivers/net/ethernet/intel/ice/ice_fltr.c
index af57eb114966..85a94483c2ed 100644
--- a/drivers/net/ethernet/intel/ice/ice_fltr.c
+++ b/drivers/net/ethernet/intel/ice/ice_fltr.c
@@ -58,7 +58,16 @@ int
ice_fltr_set_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -73,7 +82,16 @@ int
ice_fltr_clear_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -87,7 +105,16 @@ int
ice_fltr_clear_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
@@ -101,7 +128,16 @@ int
ice_fltr_set_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index d755ce07869f..1d2ca39add95 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -243,8 +243,7 @@ static int ice_add_mac_to_unsync_list(struct net_device *netdev, const u8 *addr)
static bool ice_vsi_fltr_changed(struct ice_vsi *vsi)
{
return test_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
}
/**
@@ -260,10 +259,15 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -280,10 +284,15 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -302,7 +311,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
struct ice_pf *pf = vsi->back;
struct ice_hw *hw = &pf->hw;
u32 changed_flags = 0;
- u8 promisc_m;
int err;
if (!vsi->netdev)
@@ -320,7 +328,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
if (ice_vsi_fltr_changed(vsi)) {
clear_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state);
clear_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
- clear_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
/* grab the netdev's addr_list_lock */
netif_addr_lock_bh(netdev);
@@ -369,29 +376,15 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
/* check for changes in promiscuous modes */
if (changed_flags & IFF_ALLMULTI) {
if (vsi->current_netdev_flags & IFF_ALLMULTI) {
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_set_promisc(vsi, promisc_m);
+ err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error setting Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags &= ~IFF_ALLMULTI;
goto out_promisc;
}
} else {
/* !(vsi->current_netdev_flags & IFF_ALLMULTI) */
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_clear_promisc(vsi, promisc_m);
+ err = ice_clear_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error clearing Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags |= IFF_ALLMULTI;
goto out_promisc;
}
@@ -3488,6 +3481,20 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
+ /* Add multicast promisc rule for the VLAN ID to be added if
+ * all-multicast is currently enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ vid);
+ if (ret)
+ goto finish;
+ }
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Add a switch rule for this VLAN ID so its corresponding VLAN tagged
@@ -3495,8 +3502,23 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
*/
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->add_vlan(vsi, &vlan);
- if (!ret)
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ if (ret)
+ goto finish;
+
+ /* If all-multicast is currently enabled and this VLAN ID is only one
+ * besides VLAN-0 we have to update look-up type of multicast promisc
+ * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
+ */
+ if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+ ice_vsi_num_non_zero_vlans(vsi) == 1) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, 0);
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
return ret;
}
@@ -3522,6 +3544,9 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Make sure VLAN delete is successful before updating VLAN
@@ -3530,10 +3555,33 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->del_vlan(vsi, &vlan);
if (ret)
- return ret;
+ goto finish;
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
- return 0;
+ /* Remove multicast promisc rule for the removed VLAN ID if
+ * all-multicast is enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI)
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, vid);
+
+ if (!ice_vsi_has_non_zero_vlans(vsi)) {
+ /* Update look-up type of multicast promisc rule for VLAN 0
+ * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
+ * all-multicast is enabled and VLAN 0 is the only VLAN rule.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ }
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
+
+ return ret;
}
/**
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1273f89578f268ea705ddbad60c4bd2dcff80611 Mon Sep 17 00:00:00 2001
From: Ivan Vecera <ivecera(a)redhat.com>
Date: Thu, 31 Mar 2022 09:20:08 -0700
Subject: [PATCH] ice: Fix broken IFF_ALLMULTI handling
Handling of all-multicast flag and associated multicast promiscuous
mode is broken in ice driver. When an user switches allmulticast
flag on or off the driver checks whether any VLANs are configured
over the interface (except default VLAN 0).
If any extra VLANs are registered it enables multicast promiscuous
mode for all these VLANs (including default VLAN 0) using
ICE_SW_LKUP_PROMISC_VLAN look-up type. In this situation all
multicast packets tagged with known VLAN ID or untagged are received
and multicast packets tagged with unknown VLAN ID ignored.
If no extra VLANs are registered (so only VLAN 0 exists) it enables
multicast promiscuous mode for VLAN 0 and uses ICE_SW_LKUP_PROMISC
look-up type. In this situation any multicast packets including
tagged ones are received.
The driver handles IFF_ALLMULTI in ice_vsi_sync_fltr() this way:
ice_vsi_sync_fltr() {
...
if (changed_flags & IFF_ALLMULTI) {
if (netdev->flags & IFF_ALLMULTI) {
if (vsi->num_vlans > 1)
ice_set_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS);
} else {
if (vsi->num_vlans > 1)
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS);
}
}
...
}
The code above depends on value vsi->num_vlan that specifies number
of VLANs configured over the interface (including VLAN 0) and
this is problem because that value is modified in NDO callbacks
ice_vlan_rx_add_vid() and ice_vlan_rx_kill_vid().
Scenario 1:
1. ip link set ens7f0 allmulticast on
2. ip link add vlan10 link ens7f0 type vlan id 10
3. ip link set ens7f0 allmulticast off
4. ip link set ens7f0 allmulticast on
[1] In this scenario IFF_ALLMULTI is enabled and the driver calls
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS) that installs
multicast promisc rule with non-VLAN look-up type.
[2] Then VLAN with ID 10 is added and vsi->num_vlan incremented to 2
[3] Command switches IFF_ALLMULTI off and the driver calls
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS) but this
call is effectively NOP because it looks for multicast promisc
rules for VLAN 0 and VLAN 10 with VLAN look-up type but no such
rules exist. So the all-multicast remains enabled silently
in hardware.
[4] Command tries to switch IFF_ALLMULTI on and the driver calls
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS) but this call
fails (-EEXIST) because non-VLAN multicast promisc rule already
exists.
Scenario 2:
1. ip link add vlan10 link ens7f0 type vlan id 10
2. ip link set ens7f0 allmulticast on
3. ip link add vlan20 link ens7f0 type vlan id 20
4. ip link del vlan10 ; ip link del vlan20
5. ip link set ens7f0 allmulticast off
[1] VLAN with ID 10 is added and vsi->num_vlan==2
[2] Command switches IFF_ALLMULTI on and driver installs multicast
promisc rules with VLAN look-up type for VLAN 0 and 10
[3] VLAN with ID 20 is added and vsi->num_vlan==3 but no multicast
promisc rules is added for this new VLAN so the interface does
not receive MC packets from VLAN 20
[4] Both VLANs are removed but multicast rule for VLAN 10 remains
installed so interface receives multicast packets from VLAN 10
[5] Command switches IFF_ALLMULTI off and because vsi->num_vlan is 1
the driver tries to remove multicast promisc rule for VLAN 0
with non-VLAN look-up that does not exist.
All-multicast looks disabled from user point of view but it
is partially enabled in HW (interface receives all multicast
packets either untagged or tagged with VLAN ID 10)
To resolve these issues the patch introduces these changes:
1. Adds handling for IFF_ALLMULTI to ice_vlan_rx_add_vid() and
ice_vlan_rx_kill_vid() callbacks. So when VLAN is added/removed
and IFF_ALLMULTI is enabled an appropriate multicast promisc
rule for that VLAN ID is added/removed.
2. In ice_vlan_rx_add_vid() when first VLAN besides VLAN 0 is added
so (vsi->num_vlan == 2) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_VLAN_PROMISC_BITS.
3. In ice_vlan_rx_kill_vid() when last VLAN besides VLAN 0 is removed
so (vsi->num_vlan == 1) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_PROMISC_BITS.
4. Both ice_vlan_rx_{add,kill}_vid() have to run under ICE_CFG_BUSY
bit protection to avoid races with ice_vsi_sync_fltr() that runs
in ice_service_task() context.
5. Bit ICE_VSI_VLAN_FLTR_CHANGED is use-less and can be removed.
6. Error messages added to ice_fltr_*_vsi_promisc() helper functions
to avoid them in their callers
7. Small improvements to increase readability
Fixes: 5eda8afd6bcc ("ice: Add support for PF/VF promiscuous mode")
Signed-off-by: Ivan Vecera <ivecera(a)redhat.com>
Reviewed-by: Jacob Keller <jacob.e.keller(a)intel.com>
Signed-off-by: Alice Michael <alice.michael(a)intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index d4f1874df7d0..26eaee0b6503 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -301,7 +301,6 @@ enum ice_vsi_state {
ICE_VSI_NETDEV_REGISTERED,
ICE_VSI_UMAC_FLTR_CHANGED,
ICE_VSI_MMAC_FLTR_CHANGED,
- ICE_VSI_VLAN_FLTR_CHANGED,
ICE_VSI_PROMISC_CHANGED,
ICE_VSI_STATE_NBITS /* must be last */
};
diff --git a/drivers/net/ethernet/intel/ice/ice_fltr.c b/drivers/net/ethernet/intel/ice/ice_fltr.c
index af57eb114966..85a94483c2ed 100644
--- a/drivers/net/ethernet/intel/ice/ice_fltr.c
+++ b/drivers/net/ethernet/intel/ice/ice_fltr.c
@@ -58,7 +58,16 @@ int
ice_fltr_set_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -73,7 +82,16 @@ int
ice_fltr_clear_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -87,7 +105,16 @@ int
ice_fltr_clear_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
@@ -101,7 +128,16 @@ int
ice_fltr_set_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index d755ce07869f..1d2ca39add95 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -243,8 +243,7 @@ static int ice_add_mac_to_unsync_list(struct net_device *netdev, const u8 *addr)
static bool ice_vsi_fltr_changed(struct ice_vsi *vsi)
{
return test_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
}
/**
@@ -260,10 +259,15 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -280,10 +284,15 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -302,7 +311,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
struct ice_pf *pf = vsi->back;
struct ice_hw *hw = &pf->hw;
u32 changed_flags = 0;
- u8 promisc_m;
int err;
if (!vsi->netdev)
@@ -320,7 +328,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
if (ice_vsi_fltr_changed(vsi)) {
clear_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state);
clear_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
- clear_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
/* grab the netdev's addr_list_lock */
netif_addr_lock_bh(netdev);
@@ -369,29 +376,15 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
/* check for changes in promiscuous modes */
if (changed_flags & IFF_ALLMULTI) {
if (vsi->current_netdev_flags & IFF_ALLMULTI) {
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_set_promisc(vsi, promisc_m);
+ err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error setting Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags &= ~IFF_ALLMULTI;
goto out_promisc;
}
} else {
/* !(vsi->current_netdev_flags & IFF_ALLMULTI) */
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_clear_promisc(vsi, promisc_m);
+ err = ice_clear_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error clearing Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags |= IFF_ALLMULTI;
goto out_promisc;
}
@@ -3488,6 +3481,20 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
+ /* Add multicast promisc rule for the VLAN ID to be added if
+ * all-multicast is currently enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ vid);
+ if (ret)
+ goto finish;
+ }
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Add a switch rule for this VLAN ID so its corresponding VLAN tagged
@@ -3495,8 +3502,23 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
*/
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->add_vlan(vsi, &vlan);
- if (!ret)
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ if (ret)
+ goto finish;
+
+ /* If all-multicast is currently enabled and this VLAN ID is only one
+ * besides VLAN-0 we have to update look-up type of multicast promisc
+ * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
+ */
+ if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+ ice_vsi_num_non_zero_vlans(vsi) == 1) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, 0);
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
return ret;
}
@@ -3522,6 +3544,9 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Make sure VLAN delete is successful before updating VLAN
@@ -3530,10 +3555,33 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->del_vlan(vsi, &vlan);
if (ret)
- return ret;
+ goto finish;
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
- return 0;
+ /* Remove multicast promisc rule for the removed VLAN ID if
+ * all-multicast is enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI)
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, vid);
+
+ if (!ice_vsi_has_non_zero_vlans(vsi)) {
+ /* Update look-up type of multicast promisc rule for VLAN 0
+ * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
+ * all-multicast is enabled and VLAN 0 is the only VLAN rule.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ }
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
+
+ return ret;
}
/**
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1273f89578f268ea705ddbad60c4bd2dcff80611 Mon Sep 17 00:00:00 2001
From: Ivan Vecera <ivecera(a)redhat.com>
Date: Thu, 31 Mar 2022 09:20:08 -0700
Subject: [PATCH] ice: Fix broken IFF_ALLMULTI handling
Handling of all-multicast flag and associated multicast promiscuous
mode is broken in ice driver. When an user switches allmulticast
flag on or off the driver checks whether any VLANs are configured
over the interface (except default VLAN 0).
If any extra VLANs are registered it enables multicast promiscuous
mode for all these VLANs (including default VLAN 0) using
ICE_SW_LKUP_PROMISC_VLAN look-up type. In this situation all
multicast packets tagged with known VLAN ID or untagged are received
and multicast packets tagged with unknown VLAN ID ignored.
If no extra VLANs are registered (so only VLAN 0 exists) it enables
multicast promiscuous mode for VLAN 0 and uses ICE_SW_LKUP_PROMISC
look-up type. In this situation any multicast packets including
tagged ones are received.
The driver handles IFF_ALLMULTI in ice_vsi_sync_fltr() this way:
ice_vsi_sync_fltr() {
...
if (changed_flags & IFF_ALLMULTI) {
if (netdev->flags & IFF_ALLMULTI) {
if (vsi->num_vlans > 1)
ice_set_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS);
} else {
if (vsi->num_vlans > 1)
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
else
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS);
}
}
...
}
The code above depends on value vsi->num_vlan that specifies number
of VLANs configured over the interface (including VLAN 0) and
this is problem because that value is modified in NDO callbacks
ice_vlan_rx_add_vid() and ice_vlan_rx_kill_vid().
Scenario 1:
1. ip link set ens7f0 allmulticast on
2. ip link add vlan10 link ens7f0 type vlan id 10
3. ip link set ens7f0 allmulticast off
4. ip link set ens7f0 allmulticast on
[1] In this scenario IFF_ALLMULTI is enabled and the driver calls
ice_set_promisc(..., ICE_MCAST_PROMISC_BITS) that installs
multicast promisc rule with non-VLAN look-up type.
[2] Then VLAN with ID 10 is added and vsi->num_vlan incremented to 2
[3] Command switches IFF_ALLMULTI off and the driver calls
ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS) but this
call is effectively NOP because it looks for multicast promisc
rules for VLAN 0 and VLAN 10 with VLAN look-up type but no such
rules exist. So the all-multicast remains enabled silently
in hardware.
[4] Command tries to switch IFF_ALLMULTI on and the driver calls
ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS) but this call
fails (-EEXIST) because non-VLAN multicast promisc rule already
exists.
Scenario 2:
1. ip link add vlan10 link ens7f0 type vlan id 10
2. ip link set ens7f0 allmulticast on
3. ip link add vlan20 link ens7f0 type vlan id 20
4. ip link del vlan10 ; ip link del vlan20
5. ip link set ens7f0 allmulticast off
[1] VLAN with ID 10 is added and vsi->num_vlan==2
[2] Command switches IFF_ALLMULTI on and driver installs multicast
promisc rules with VLAN look-up type for VLAN 0 and 10
[3] VLAN with ID 20 is added and vsi->num_vlan==3 but no multicast
promisc rules is added for this new VLAN so the interface does
not receive MC packets from VLAN 20
[4] Both VLANs are removed but multicast rule for VLAN 10 remains
installed so interface receives multicast packets from VLAN 10
[5] Command switches IFF_ALLMULTI off and because vsi->num_vlan is 1
the driver tries to remove multicast promisc rule for VLAN 0
with non-VLAN look-up that does not exist.
All-multicast looks disabled from user point of view but it
is partially enabled in HW (interface receives all multicast
packets either untagged or tagged with VLAN ID 10)
To resolve these issues the patch introduces these changes:
1. Adds handling for IFF_ALLMULTI to ice_vlan_rx_add_vid() and
ice_vlan_rx_kill_vid() callbacks. So when VLAN is added/removed
and IFF_ALLMULTI is enabled an appropriate multicast promisc
rule for that VLAN ID is added/removed.
2. In ice_vlan_rx_add_vid() when first VLAN besides VLAN 0 is added
so (vsi->num_vlan == 2) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_VLAN_PROMISC_BITS.
3. In ice_vlan_rx_kill_vid() when last VLAN besides VLAN 0 is removed
so (vsi->num_vlan == 1) and IFF_ALLMULTI is enabled then look-up
type for existing multicast promisc rule for VLAN 0 is updated
to ICE_MCAST_PROMISC_BITS.
4. Both ice_vlan_rx_{add,kill}_vid() have to run under ICE_CFG_BUSY
bit protection to avoid races with ice_vsi_sync_fltr() that runs
in ice_service_task() context.
5. Bit ICE_VSI_VLAN_FLTR_CHANGED is use-less and can be removed.
6. Error messages added to ice_fltr_*_vsi_promisc() helper functions
to avoid them in their callers
7. Small improvements to increase readability
Fixes: 5eda8afd6bcc ("ice: Add support for PF/VF promiscuous mode")
Signed-off-by: Ivan Vecera <ivecera(a)redhat.com>
Reviewed-by: Jacob Keller <jacob.e.keller(a)intel.com>
Signed-off-by: Alice Michael <alice.michael(a)intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index d4f1874df7d0..26eaee0b6503 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -301,7 +301,6 @@ enum ice_vsi_state {
ICE_VSI_NETDEV_REGISTERED,
ICE_VSI_UMAC_FLTR_CHANGED,
ICE_VSI_MMAC_FLTR_CHANGED,
- ICE_VSI_VLAN_FLTR_CHANGED,
ICE_VSI_PROMISC_CHANGED,
ICE_VSI_STATE_NBITS /* must be last */
};
diff --git a/drivers/net/ethernet/intel/ice/ice_fltr.c b/drivers/net/ethernet/intel/ice/ice_fltr.c
index af57eb114966..85a94483c2ed 100644
--- a/drivers/net/ethernet/intel/ice/ice_fltr.c
+++ b/drivers/net/ethernet/intel/ice/ice_fltr.c
@@ -58,7 +58,16 @@ int
ice_fltr_set_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, false);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -73,7 +82,16 @@ int
ice_fltr_clear_vlan_vsi_promisc(struct ice_hw *hw, struct ice_vsi *vsi,
u8 promisc_mask)
{
- return ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vlan_vsi_promisc(hw, vsi->idx, promisc_mask, true);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i (rc=%d)\n",
+ vsi->vsi_num, result);
+
+ return result;
}
/**
@@ -87,7 +105,16 @@ int
ice_fltr_clear_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_clear_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error clearing promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
@@ -101,7 +128,16 @@ int
ice_fltr_set_vsi_promisc(struct ice_hw *hw, u16 vsi_handle, u8 promisc_mask,
u16 vid)
{
- return ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ struct ice_pf *pf = hw->back;
+ int result;
+
+ result = ice_set_vsi_promisc(hw, vsi_handle, promisc_mask, vid);
+ if (result)
+ dev_err(ice_pf_to_dev(pf),
+ "Error setting promisc mode on VSI %i for VID %u (rc=%d)\n",
+ ice_get_hw_vsi_num(hw, vsi_handle), vid, result);
+
+ return result;
}
/**
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index d755ce07869f..1d2ca39add95 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -243,8 +243,7 @@ static int ice_add_mac_to_unsync_list(struct net_device *netdev, const u8 *addr)
static bool ice_vsi_fltr_changed(struct ice_vsi *vsi)
{
return test_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state) ||
- test_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ test_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
}
/**
@@ -260,10 +259,15 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -280,10 +284,15 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi))
- status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi, promisc_m);
- else
- status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx, promisc_m, 0);
+ if (ice_vsi_has_non_zero_vlans(vsi)) {
+ promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi,
+ promisc_m);
+ } else {
+ status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ promisc_m, 0);
+ }
+
return status;
}
@@ -302,7 +311,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
struct ice_pf *pf = vsi->back;
struct ice_hw *hw = &pf->hw;
u32 changed_flags = 0;
- u8 promisc_m;
int err;
if (!vsi->netdev)
@@ -320,7 +328,6 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
if (ice_vsi_fltr_changed(vsi)) {
clear_bit(ICE_VSI_UMAC_FLTR_CHANGED, vsi->state);
clear_bit(ICE_VSI_MMAC_FLTR_CHANGED, vsi->state);
- clear_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
/* grab the netdev's addr_list_lock */
netif_addr_lock_bh(netdev);
@@ -369,29 +376,15 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
/* check for changes in promiscuous modes */
if (changed_flags & IFF_ALLMULTI) {
if (vsi->current_netdev_flags & IFF_ALLMULTI) {
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_set_promisc(vsi, promisc_m);
+ err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error setting Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags &= ~IFF_ALLMULTI;
goto out_promisc;
}
} else {
/* !(vsi->current_netdev_flags & IFF_ALLMULTI) */
- if (ice_vsi_has_non_zero_vlans(vsi))
- promisc_m = ICE_MCAST_VLAN_PROMISC_BITS;
- else
- promisc_m = ICE_MCAST_PROMISC_BITS;
-
- err = ice_clear_promisc(vsi, promisc_m);
+ err = ice_clear_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err) {
- netdev_err(netdev, "Error clearing Multicast promiscuous mode on VSI %i\n",
- vsi->vsi_num);
vsi->current_netdev_flags |= IFF_ALLMULTI;
goto out_promisc;
}
@@ -3488,6 +3481,20 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
+ /* Add multicast promisc rule for the VLAN ID to be added if
+ * all-multicast is currently enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ vid);
+ if (ret)
+ goto finish;
+ }
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Add a switch rule for this VLAN ID so its corresponding VLAN tagged
@@ -3495,8 +3502,23 @@ ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
*/
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->add_vlan(vsi, &vlan);
- if (!ret)
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
+ if (ret)
+ goto finish;
+
+ /* If all-multicast is currently enabled and this VLAN ID is only one
+ * besides VLAN-0 we have to update look-up type of multicast promisc
+ * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
+ */
+ if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+ ice_vsi_num_non_zero_vlans(vsi) == 1) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, 0);
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
return ret;
}
@@ -3522,6 +3544,9 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (!vid)
return 0;
+ while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
+ usleep_range(1000, 2000);
+
vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
/* Make sure VLAN delete is successful before updating VLAN
@@ -3530,10 +3555,33 @@ ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
vlan = ICE_VLAN(be16_to_cpu(proto), vid, 0);
ret = vlan_ops->del_vlan(vsi, &vlan);
if (ret)
- return ret;
+ goto finish;
- set_bit(ICE_VSI_VLAN_FLTR_CHANGED, vsi->state);
- return 0;
+ /* Remove multicast promisc rule for the removed VLAN ID if
+ * all-multicast is enabled.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI)
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS, vid);
+
+ if (!ice_vsi_has_non_zero_vlans(vsi)) {
+ /* Update look-up type of multicast promisc rule for VLAN 0
+ * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
+ * all-multicast is enabled and VLAN 0 is the only VLAN rule.
+ */
+ if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_VLAN_PROMISC_BITS,
+ 0);
+ ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
+ ICE_MCAST_PROMISC_BITS, 0);
+ }
+ }
+
+finish:
+ clear_bit(ICE_CFG_BUSY, vsi->state);
+
+ return ret;
}
/**
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 96492a6c558acb56124844d1409d9ef8624a0322 Mon Sep 17 00:00:00 2001
From: Chengming Zhou <zhouchengming(a)bytedance.com>
Date: Tue, 29 Mar 2022 23:45:22 +0800
Subject: [PATCH] perf/core: Fix perf_cgroup_switch()
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a08fb92b3934..bdeb41fe7f15 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -824,17 +824,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-#define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
-#define PERF_CGROUP_SWIN 0x2 /* cgroup switch in events based on task */
-
/*
* reschedule events based on the cgroup constraint of task.
- *
- * mode SWOUT : schedule out everything
- * mode SWIN : schedule in based on cgroup for next
*/
-static void perf_cgroup_switch(struct task_struct *task, int mode)
+static void perf_cgroup_switch(struct task_struct *task)
{
+ struct perf_cgroup *cgrp;
struct perf_cpu_context *cpuctx, *tmp;
struct list_head *list;
unsigned long flags;
@@ -845,35 +840,31 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
*/
local_irq_save(flags);
+ cgrp = perf_cgroup_from_task(task, NULL);
+
list = this_cpu_ptr(&cgrp_cpuctx_list);
list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+ if (READ_ONCE(cpuctx->cgrp) == cgrp)
+ continue;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
- if (mode & PERF_CGROUP_SWOUT) {
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
- /*
- * must not be done before ctxswout due
- * to event_filter_match() in event_sched_out()
- */
- cpuctx->cgrp = NULL;
- }
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ /*
+ * must not be done before ctxswout due
+ * to update_cgrp_time_from_cpuctx() in
+ * ctx_sched_out()
+ */
+ cpuctx->cgrp = cgrp;
+ /*
+ * set cgrp before ctxsw in to allow
+ * perf_cgroup_set_timestamp() in ctx_sched_in()
+ * to not have to pass task around
+ */
+ cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- if (mode & PERF_CGROUP_SWIN) {
- WARN_ON_ONCE(cpuctx->cgrp);
- /*
- * set cgrp before ctxsw in to allow
- * perf_cgroup_set_timestamp() in ctx_sched_in()
- * to not have to pass task around
- * we pass the cpuctx->ctx to perf_cgroup_from_task()
- * because cgorup events are only per-cpu
- */
- cpuctx->cgrp = perf_cgroup_from_task(task,
- &cpuctx->ctx);
- cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- }
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -881,58 +872,6 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
local_irq_restore(flags);
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(next, NULL);
-
- /*
- * only schedule out current cgroup events if we know
- * that we are switching to a different cgroup. Otherwise,
- * do no touch the cgroup events.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT);
-
- rcu_read_unlock();
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(prev, NULL);
-
- /*
- * only need to schedule in cgroup events if we are changing
- * cgroup during ctxsw. Cgroup events were not scheduled
- * out of ctxsw out if that was not the case.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWIN);
-
- rcu_read_unlock();
-}
-
static int perf_cgroup_ensure_storage(struct perf_event *event,
struct cgroup_subsys_state *css)
{
@@ -1096,16 +1035,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
{
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
-}
-
static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
struct perf_event_attr *attr,
struct perf_event *group_leader)
@@ -1118,11 +1047,6 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
{
}
-static inline void
-perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
static inline u64 perf_cgroup_event_time(struct perf_event *event)
{
return 0;
@@ -1142,6 +1066,10 @@ static inline void
perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
{
}
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
#endif
/*
@@ -3661,7 +3589,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
* cgroup event are system-wide mode only
*/
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_out(task, next);
+ perf_cgroup_switch(next);
}
/*
@@ -3975,16 +3903,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
struct perf_event_context *ctx;
int ctxn;
- /*
- * If cgroup events exist on this CPU, then we need to check if we have
- * to switch in PMU state; cgroup event are system-wide mode only.
- *
- * Since cgroup events are CPU events, we must schedule these in before
- * we schedule in the task events.
- */
- if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_in(prev, task);
-
for_each_task_context_nr(ctxn) {
ctx = task->perf_event_ctxp[ctxn];
if (likely(!ctx))
@@ -13556,7 +13474,7 @@ static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
rcu_read_lock();
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN);
+ perf_cgroup_switch(task);
rcu_read_unlock();
return 0;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 96492a6c558acb56124844d1409d9ef8624a0322 Mon Sep 17 00:00:00 2001
From: Chengming Zhou <zhouchengming(a)bytedance.com>
Date: Tue, 29 Mar 2022 23:45:22 +0800
Subject: [PATCH] perf/core: Fix perf_cgroup_switch()
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a08fb92b3934..bdeb41fe7f15 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -824,17 +824,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-#define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
-#define PERF_CGROUP_SWIN 0x2 /* cgroup switch in events based on task */
-
/*
* reschedule events based on the cgroup constraint of task.
- *
- * mode SWOUT : schedule out everything
- * mode SWIN : schedule in based on cgroup for next
*/
-static void perf_cgroup_switch(struct task_struct *task, int mode)
+static void perf_cgroup_switch(struct task_struct *task)
{
+ struct perf_cgroup *cgrp;
struct perf_cpu_context *cpuctx, *tmp;
struct list_head *list;
unsigned long flags;
@@ -845,35 +840,31 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
*/
local_irq_save(flags);
+ cgrp = perf_cgroup_from_task(task, NULL);
+
list = this_cpu_ptr(&cgrp_cpuctx_list);
list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+ if (READ_ONCE(cpuctx->cgrp) == cgrp)
+ continue;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
- if (mode & PERF_CGROUP_SWOUT) {
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
- /*
- * must not be done before ctxswout due
- * to event_filter_match() in event_sched_out()
- */
- cpuctx->cgrp = NULL;
- }
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ /*
+ * must not be done before ctxswout due
+ * to update_cgrp_time_from_cpuctx() in
+ * ctx_sched_out()
+ */
+ cpuctx->cgrp = cgrp;
+ /*
+ * set cgrp before ctxsw in to allow
+ * perf_cgroup_set_timestamp() in ctx_sched_in()
+ * to not have to pass task around
+ */
+ cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- if (mode & PERF_CGROUP_SWIN) {
- WARN_ON_ONCE(cpuctx->cgrp);
- /*
- * set cgrp before ctxsw in to allow
- * perf_cgroup_set_timestamp() in ctx_sched_in()
- * to not have to pass task around
- * we pass the cpuctx->ctx to perf_cgroup_from_task()
- * because cgorup events are only per-cpu
- */
- cpuctx->cgrp = perf_cgroup_from_task(task,
- &cpuctx->ctx);
- cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- }
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -881,58 +872,6 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
local_irq_restore(flags);
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(next, NULL);
-
- /*
- * only schedule out current cgroup events if we know
- * that we are switching to a different cgroup. Otherwise,
- * do no touch the cgroup events.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT);
-
- rcu_read_unlock();
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(prev, NULL);
-
- /*
- * only need to schedule in cgroup events if we are changing
- * cgroup during ctxsw. Cgroup events were not scheduled
- * out of ctxsw out if that was not the case.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWIN);
-
- rcu_read_unlock();
-}
-
static int perf_cgroup_ensure_storage(struct perf_event *event,
struct cgroup_subsys_state *css)
{
@@ -1096,16 +1035,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
{
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
-}
-
static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
struct perf_event_attr *attr,
struct perf_event *group_leader)
@@ -1118,11 +1047,6 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
{
}
-static inline void
-perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
static inline u64 perf_cgroup_event_time(struct perf_event *event)
{
return 0;
@@ -1142,6 +1066,10 @@ static inline void
perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
{
}
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
#endif
/*
@@ -3661,7 +3589,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
* cgroup event are system-wide mode only
*/
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_out(task, next);
+ perf_cgroup_switch(next);
}
/*
@@ -3975,16 +3903,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
struct perf_event_context *ctx;
int ctxn;
- /*
- * If cgroup events exist on this CPU, then we need to check if we have
- * to switch in PMU state; cgroup event are system-wide mode only.
- *
- * Since cgroup events are CPU events, we must schedule these in before
- * we schedule in the task events.
- */
- if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_in(prev, task);
-
for_each_task_context_nr(ctxn) {
ctx = task->perf_event_ctxp[ctxn];
if (likely(!ctx))
@@ -13556,7 +13474,7 @@ static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
rcu_read_lock();
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN);
+ perf_cgroup_switch(task);
rcu_read_unlock();
return 0;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 96492a6c558acb56124844d1409d9ef8624a0322 Mon Sep 17 00:00:00 2001
From: Chengming Zhou <zhouchengming(a)bytedance.com>
Date: Tue, 29 Mar 2022 23:45:22 +0800
Subject: [PATCH] perf/core: Fix perf_cgroup_switch()
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a08fb92b3934..bdeb41fe7f15 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -824,17 +824,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-#define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
-#define PERF_CGROUP_SWIN 0x2 /* cgroup switch in events based on task */
-
/*
* reschedule events based on the cgroup constraint of task.
- *
- * mode SWOUT : schedule out everything
- * mode SWIN : schedule in based on cgroup for next
*/
-static void perf_cgroup_switch(struct task_struct *task, int mode)
+static void perf_cgroup_switch(struct task_struct *task)
{
+ struct perf_cgroup *cgrp;
struct perf_cpu_context *cpuctx, *tmp;
struct list_head *list;
unsigned long flags;
@@ -845,35 +840,31 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
*/
local_irq_save(flags);
+ cgrp = perf_cgroup_from_task(task, NULL);
+
list = this_cpu_ptr(&cgrp_cpuctx_list);
list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+ if (READ_ONCE(cpuctx->cgrp) == cgrp)
+ continue;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
- if (mode & PERF_CGROUP_SWOUT) {
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
- /*
- * must not be done before ctxswout due
- * to event_filter_match() in event_sched_out()
- */
- cpuctx->cgrp = NULL;
- }
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ /*
+ * must not be done before ctxswout due
+ * to update_cgrp_time_from_cpuctx() in
+ * ctx_sched_out()
+ */
+ cpuctx->cgrp = cgrp;
+ /*
+ * set cgrp before ctxsw in to allow
+ * perf_cgroup_set_timestamp() in ctx_sched_in()
+ * to not have to pass task around
+ */
+ cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- if (mode & PERF_CGROUP_SWIN) {
- WARN_ON_ONCE(cpuctx->cgrp);
- /*
- * set cgrp before ctxsw in to allow
- * perf_cgroup_set_timestamp() in ctx_sched_in()
- * to not have to pass task around
- * we pass the cpuctx->ctx to perf_cgroup_from_task()
- * because cgorup events are only per-cpu
- */
- cpuctx->cgrp = perf_cgroup_from_task(task,
- &cpuctx->ctx);
- cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- }
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -881,58 +872,6 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
local_irq_restore(flags);
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(next, NULL);
-
- /*
- * only schedule out current cgroup events if we know
- * that we are switching to a different cgroup. Otherwise,
- * do no touch the cgroup events.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT);
-
- rcu_read_unlock();
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(prev, NULL);
-
- /*
- * only need to schedule in cgroup events if we are changing
- * cgroup during ctxsw. Cgroup events were not scheduled
- * out of ctxsw out if that was not the case.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWIN);
-
- rcu_read_unlock();
-}
-
static int perf_cgroup_ensure_storage(struct perf_event *event,
struct cgroup_subsys_state *css)
{
@@ -1096,16 +1035,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
{
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
-}
-
static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
struct perf_event_attr *attr,
struct perf_event *group_leader)
@@ -1118,11 +1047,6 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
{
}
-static inline void
-perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
static inline u64 perf_cgroup_event_time(struct perf_event *event)
{
return 0;
@@ -1142,6 +1066,10 @@ static inline void
perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
{
}
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
#endif
/*
@@ -3661,7 +3589,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
* cgroup event are system-wide mode only
*/
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_out(task, next);
+ perf_cgroup_switch(next);
}
/*
@@ -3975,16 +3903,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
struct perf_event_context *ctx;
int ctxn;
- /*
- * If cgroup events exist on this CPU, then we need to check if we have
- * to switch in PMU state; cgroup event are system-wide mode only.
- *
- * Since cgroup events are CPU events, we must schedule these in before
- * we schedule in the task events.
- */
- if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_in(prev, task);
-
for_each_task_context_nr(ctxn) {
ctx = task->perf_event_ctxp[ctxn];
if (likely(!ctx))
@@ -13556,7 +13474,7 @@ static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
rcu_read_lock();
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN);
+ perf_cgroup_switch(task);
rcu_read_unlock();
return 0;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 96492a6c558acb56124844d1409d9ef8624a0322 Mon Sep 17 00:00:00 2001
From: Chengming Zhou <zhouchengming(a)bytedance.com>
Date: Tue, 29 Mar 2022 23:45:22 +0800
Subject: [PATCH] perf/core: Fix perf_cgroup_switch()
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a08fb92b3934..bdeb41fe7f15 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -824,17 +824,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-#define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
-#define PERF_CGROUP_SWIN 0x2 /* cgroup switch in events based on task */
-
/*
* reschedule events based on the cgroup constraint of task.
- *
- * mode SWOUT : schedule out everything
- * mode SWIN : schedule in based on cgroup for next
*/
-static void perf_cgroup_switch(struct task_struct *task, int mode)
+static void perf_cgroup_switch(struct task_struct *task)
{
+ struct perf_cgroup *cgrp;
struct perf_cpu_context *cpuctx, *tmp;
struct list_head *list;
unsigned long flags;
@@ -845,35 +840,31 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
*/
local_irq_save(flags);
+ cgrp = perf_cgroup_from_task(task, NULL);
+
list = this_cpu_ptr(&cgrp_cpuctx_list);
list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+ if (READ_ONCE(cpuctx->cgrp) == cgrp)
+ continue;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
- if (mode & PERF_CGROUP_SWOUT) {
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
- /*
- * must not be done before ctxswout due
- * to event_filter_match() in event_sched_out()
- */
- cpuctx->cgrp = NULL;
- }
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ /*
+ * must not be done before ctxswout due
+ * to update_cgrp_time_from_cpuctx() in
+ * ctx_sched_out()
+ */
+ cpuctx->cgrp = cgrp;
+ /*
+ * set cgrp before ctxsw in to allow
+ * perf_cgroup_set_timestamp() in ctx_sched_in()
+ * to not have to pass task around
+ */
+ cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- if (mode & PERF_CGROUP_SWIN) {
- WARN_ON_ONCE(cpuctx->cgrp);
- /*
- * set cgrp before ctxsw in to allow
- * perf_cgroup_set_timestamp() in ctx_sched_in()
- * to not have to pass task around
- * we pass the cpuctx->ctx to perf_cgroup_from_task()
- * because cgorup events are only per-cpu
- */
- cpuctx->cgrp = perf_cgroup_from_task(task,
- &cpuctx->ctx);
- cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- }
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -881,58 +872,6 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
local_irq_restore(flags);
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(next, NULL);
-
- /*
- * only schedule out current cgroup events if we know
- * that we are switching to a different cgroup. Otherwise,
- * do no touch the cgroup events.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT);
-
- rcu_read_unlock();
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(prev, NULL);
-
- /*
- * only need to schedule in cgroup events if we are changing
- * cgroup during ctxsw. Cgroup events were not scheduled
- * out of ctxsw out if that was not the case.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWIN);
-
- rcu_read_unlock();
-}
-
static int perf_cgroup_ensure_storage(struct perf_event *event,
struct cgroup_subsys_state *css)
{
@@ -1096,16 +1035,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
{
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
-}
-
static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
struct perf_event_attr *attr,
struct perf_event *group_leader)
@@ -1118,11 +1047,6 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
{
}
-static inline void
-perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
static inline u64 perf_cgroup_event_time(struct perf_event *event)
{
return 0;
@@ -1142,6 +1066,10 @@ static inline void
perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
{
}
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
#endif
/*
@@ -3661,7 +3589,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
* cgroup event are system-wide mode only
*/
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_out(task, next);
+ perf_cgroup_switch(next);
}
/*
@@ -3975,16 +3903,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
struct perf_event_context *ctx;
int ctxn;
- /*
- * If cgroup events exist on this CPU, then we need to check if we have
- * to switch in PMU state; cgroup event are system-wide mode only.
- *
- * Since cgroup events are CPU events, we must schedule these in before
- * we schedule in the task events.
- */
- if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_in(prev, task);
-
for_each_task_context_nr(ctxn) {
ctx = task->perf_event_ctxp[ctxn];
if (likely(!ctx))
@@ -13556,7 +13474,7 @@ static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
rcu_read_lock();
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN);
+ perf_cgroup_switch(task);
rcu_read_unlock();
return 0;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 96492a6c558acb56124844d1409d9ef8624a0322 Mon Sep 17 00:00:00 2001
From: Chengming Zhou <zhouchengming(a)bytedance.com>
Date: Tue, 29 Mar 2022 23:45:22 +0800
Subject: [PATCH] perf/core: Fix perf_cgroup_switch()
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a08fb92b3934..bdeb41fe7f15 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -824,17 +824,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-#define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
-#define PERF_CGROUP_SWIN 0x2 /* cgroup switch in events based on task */
-
/*
* reschedule events based on the cgroup constraint of task.
- *
- * mode SWOUT : schedule out everything
- * mode SWIN : schedule in based on cgroup for next
*/
-static void perf_cgroup_switch(struct task_struct *task, int mode)
+static void perf_cgroup_switch(struct task_struct *task)
{
+ struct perf_cgroup *cgrp;
struct perf_cpu_context *cpuctx, *tmp;
struct list_head *list;
unsigned long flags;
@@ -845,35 +840,31 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
*/
local_irq_save(flags);
+ cgrp = perf_cgroup_from_task(task, NULL);
+
list = this_cpu_ptr(&cgrp_cpuctx_list);
list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+ if (READ_ONCE(cpuctx->cgrp) == cgrp)
+ continue;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
- if (mode & PERF_CGROUP_SWOUT) {
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
- /*
- * must not be done before ctxswout due
- * to event_filter_match() in event_sched_out()
- */
- cpuctx->cgrp = NULL;
- }
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ /*
+ * must not be done before ctxswout due
+ * to update_cgrp_time_from_cpuctx() in
+ * ctx_sched_out()
+ */
+ cpuctx->cgrp = cgrp;
+ /*
+ * set cgrp before ctxsw in to allow
+ * perf_cgroup_set_timestamp() in ctx_sched_in()
+ * to not have to pass task around
+ */
+ cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- if (mode & PERF_CGROUP_SWIN) {
- WARN_ON_ONCE(cpuctx->cgrp);
- /*
- * set cgrp before ctxsw in to allow
- * perf_cgroup_set_timestamp() in ctx_sched_in()
- * to not have to pass task around
- * we pass the cpuctx->ctx to perf_cgroup_from_task()
- * because cgorup events are only per-cpu
- */
- cpuctx->cgrp = perf_cgroup_from_task(task,
- &cpuctx->ctx);
- cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- }
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -881,58 +872,6 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
local_irq_restore(flags);
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(next, NULL);
-
- /*
- * only schedule out current cgroup events if we know
- * that we are switching to a different cgroup. Otherwise,
- * do no touch the cgroup events.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT);
-
- rcu_read_unlock();
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(prev, NULL);
-
- /*
- * only need to schedule in cgroup events if we are changing
- * cgroup during ctxsw. Cgroup events were not scheduled
- * out of ctxsw out if that was not the case.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWIN);
-
- rcu_read_unlock();
-}
-
static int perf_cgroup_ensure_storage(struct perf_event *event,
struct cgroup_subsys_state *css)
{
@@ -1096,16 +1035,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
{
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
-}
-
static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
struct perf_event_attr *attr,
struct perf_event *group_leader)
@@ -1118,11 +1047,6 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
{
}
-static inline void
-perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
static inline u64 perf_cgroup_event_time(struct perf_event *event)
{
return 0;
@@ -1142,6 +1066,10 @@ static inline void
perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
{
}
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
#endif
/*
@@ -3661,7 +3589,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
* cgroup event are system-wide mode only
*/
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_out(task, next);
+ perf_cgroup_switch(next);
}
/*
@@ -3975,16 +3903,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
struct perf_event_context *ctx;
int ctxn;
- /*
- * If cgroup events exist on this CPU, then we need to check if we have
- * to switch in PMU state; cgroup event are system-wide mode only.
- *
- * Since cgroup events are CPU events, we must schedule these in before
- * we schedule in the task events.
- */
- if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_in(prev, task);
-
for_each_task_context_nr(ctxn) {
ctx = task->perf_event_ctxp[ctxn];
if (likely(!ctx))
@@ -13556,7 +13474,7 @@ static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
rcu_read_lock();
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN);
+ perf_cgroup_switch(task);
rcu_read_unlock();
return 0;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 96492a6c558acb56124844d1409d9ef8624a0322 Mon Sep 17 00:00:00 2001
From: Chengming Zhou <zhouchengming(a)bytedance.com>
Date: Tue, 29 Mar 2022 23:45:22 +0800
Subject: [PATCH] perf/core: Fix perf_cgroup_switch()
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a08fb92b3934..bdeb41fe7f15 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -824,17 +824,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-#define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
-#define PERF_CGROUP_SWIN 0x2 /* cgroup switch in events based on task */
-
/*
* reschedule events based on the cgroup constraint of task.
- *
- * mode SWOUT : schedule out everything
- * mode SWIN : schedule in based on cgroup for next
*/
-static void perf_cgroup_switch(struct task_struct *task, int mode)
+static void perf_cgroup_switch(struct task_struct *task)
{
+ struct perf_cgroup *cgrp;
struct perf_cpu_context *cpuctx, *tmp;
struct list_head *list;
unsigned long flags;
@@ -845,35 +840,31 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
*/
local_irq_save(flags);
+ cgrp = perf_cgroup_from_task(task, NULL);
+
list = this_cpu_ptr(&cgrp_cpuctx_list);
list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+ if (READ_ONCE(cpuctx->cgrp) == cgrp)
+ continue;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
- if (mode & PERF_CGROUP_SWOUT) {
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
- /*
- * must not be done before ctxswout due
- * to event_filter_match() in event_sched_out()
- */
- cpuctx->cgrp = NULL;
- }
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ /*
+ * must not be done before ctxswout due
+ * to update_cgrp_time_from_cpuctx() in
+ * ctx_sched_out()
+ */
+ cpuctx->cgrp = cgrp;
+ /*
+ * set cgrp before ctxsw in to allow
+ * perf_cgroup_set_timestamp() in ctx_sched_in()
+ * to not have to pass task around
+ */
+ cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- if (mode & PERF_CGROUP_SWIN) {
- WARN_ON_ONCE(cpuctx->cgrp);
- /*
- * set cgrp before ctxsw in to allow
- * perf_cgroup_set_timestamp() in ctx_sched_in()
- * to not have to pass task around
- * we pass the cpuctx->ctx to perf_cgroup_from_task()
- * because cgorup events are only per-cpu
- */
- cpuctx->cgrp = perf_cgroup_from_task(task,
- &cpuctx->ctx);
- cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- }
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -881,58 +872,6 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
local_irq_restore(flags);
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(next, NULL);
-
- /*
- * only schedule out current cgroup events if we know
- * that we are switching to a different cgroup. Otherwise,
- * do no touch the cgroup events.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT);
-
- rcu_read_unlock();
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(prev, NULL);
-
- /*
- * only need to schedule in cgroup events if we are changing
- * cgroup during ctxsw. Cgroup events were not scheduled
- * out of ctxsw out if that was not the case.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWIN);
-
- rcu_read_unlock();
-}
-
static int perf_cgroup_ensure_storage(struct perf_event *event,
struct cgroup_subsys_state *css)
{
@@ -1096,16 +1035,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
{
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
-}
-
static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
struct perf_event_attr *attr,
struct perf_event *group_leader)
@@ -1118,11 +1047,6 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
{
}
-static inline void
-perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
static inline u64 perf_cgroup_event_time(struct perf_event *event)
{
return 0;
@@ -1142,6 +1066,10 @@ static inline void
perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
{
}
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
#endif
/*
@@ -3661,7 +3589,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
* cgroup event are system-wide mode only
*/
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_out(task, next);
+ perf_cgroup_switch(next);
}
/*
@@ -3975,16 +3903,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
struct perf_event_context *ctx;
int ctxn;
- /*
- * If cgroup events exist on this CPU, then we need to check if we have
- * to switch in PMU state; cgroup event are system-wide mode only.
- *
- * Since cgroup events are CPU events, we must schedule these in before
- * we schedule in the task events.
- */
- if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_in(prev, task);
-
for_each_task_context_nr(ctxn) {
ctx = task->perf_event_ctxp[ctxn];
if (likely(!ctx))
@@ -13556,7 +13474,7 @@ static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
rcu_read_lock();
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN);
+ perf_cgroup_switch(task);
rcu_read_unlock();
return 0;
}
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 96492a6c558acb56124844d1409d9ef8624a0322 Mon Sep 17 00:00:00 2001
From: Chengming Zhou <zhouchengming(a)bytedance.com>
Date: Tue, 29 Mar 2022 23:45:22 +0800
Subject: [PATCH] perf/core: Fix perf_cgroup_switch()
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a08fb92b3934..bdeb41fe7f15 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -824,17 +824,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-#define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
-#define PERF_CGROUP_SWIN 0x2 /* cgroup switch in events based on task */
-
/*
* reschedule events based on the cgroup constraint of task.
- *
- * mode SWOUT : schedule out everything
- * mode SWIN : schedule in based on cgroup for next
*/
-static void perf_cgroup_switch(struct task_struct *task, int mode)
+static void perf_cgroup_switch(struct task_struct *task)
{
+ struct perf_cgroup *cgrp;
struct perf_cpu_context *cpuctx, *tmp;
struct list_head *list;
unsigned long flags;
@@ -845,35 +840,31 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
*/
local_irq_save(flags);
+ cgrp = perf_cgroup_from_task(task, NULL);
+
list = this_cpu_ptr(&cgrp_cpuctx_list);
list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+ if (READ_ONCE(cpuctx->cgrp) == cgrp)
+ continue;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
- if (mode & PERF_CGROUP_SWOUT) {
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
- /*
- * must not be done before ctxswout due
- * to event_filter_match() in event_sched_out()
- */
- cpuctx->cgrp = NULL;
- }
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ /*
+ * must not be done before ctxswout due
+ * to update_cgrp_time_from_cpuctx() in
+ * ctx_sched_out()
+ */
+ cpuctx->cgrp = cgrp;
+ /*
+ * set cgrp before ctxsw in to allow
+ * perf_cgroup_set_timestamp() in ctx_sched_in()
+ * to not have to pass task around
+ */
+ cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- if (mode & PERF_CGROUP_SWIN) {
- WARN_ON_ONCE(cpuctx->cgrp);
- /*
- * set cgrp before ctxsw in to allow
- * perf_cgroup_set_timestamp() in ctx_sched_in()
- * to not have to pass task around
- * we pass the cpuctx->ctx to perf_cgroup_from_task()
- * because cgorup events are only per-cpu
- */
- cpuctx->cgrp = perf_cgroup_from_task(task,
- &cpuctx->ctx);
- cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- }
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -881,58 +872,6 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
local_irq_restore(flags);
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(next, NULL);
-
- /*
- * only schedule out current cgroup events if we know
- * that we are switching to a different cgroup. Otherwise,
- * do no touch the cgroup events.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT);
-
- rcu_read_unlock();
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(prev, NULL);
-
- /*
- * only need to schedule in cgroup events if we are changing
- * cgroup during ctxsw. Cgroup events were not scheduled
- * out of ctxsw out if that was not the case.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWIN);
-
- rcu_read_unlock();
-}
-
static int perf_cgroup_ensure_storage(struct perf_event *event,
struct cgroup_subsys_state *css)
{
@@ -1096,16 +1035,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
{
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
-}
-
static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
struct perf_event_attr *attr,
struct perf_event *group_leader)
@@ -1118,11 +1047,6 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
{
}
-static inline void
-perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
static inline u64 perf_cgroup_event_time(struct perf_event *event)
{
return 0;
@@ -1142,6 +1066,10 @@ static inline void
perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
{
}
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
#endif
/*
@@ -3661,7 +3589,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
* cgroup event are system-wide mode only
*/
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_out(task, next);
+ perf_cgroup_switch(next);
}
/*
@@ -3975,16 +3903,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
struct perf_event_context *ctx;
int ctxn;
- /*
- * If cgroup events exist on this CPU, then we need to check if we have
- * to switch in PMU state; cgroup event are system-wide mode only.
- *
- * Since cgroup events are CPU events, we must schedule these in before
- * we schedule in the task events.
- */
- if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_in(prev, task);
-
for_each_task_context_nr(ctxn) {
ctx = task->perf_event_ctxp[ctxn];
if (likely(!ctx))
@@ -13556,7 +13474,7 @@ static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
rcu_read_lock();
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN);
+ perf_cgroup_switch(task);
rcu_read_unlock();
return 0;
}
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 96492a6c558acb56124844d1409d9ef8624a0322 Mon Sep 17 00:00:00 2001
From: Chengming Zhou <zhouchengming(a)bytedance.com>
Date: Tue, 29 Mar 2022 23:45:22 +0800
Subject: [PATCH] perf/core: Fix perf_cgroup_switch()
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a08fb92b3934..bdeb41fe7f15 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -824,17 +824,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-#define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
-#define PERF_CGROUP_SWIN 0x2 /* cgroup switch in events based on task */
-
/*
* reschedule events based on the cgroup constraint of task.
- *
- * mode SWOUT : schedule out everything
- * mode SWIN : schedule in based on cgroup for next
*/
-static void perf_cgroup_switch(struct task_struct *task, int mode)
+static void perf_cgroup_switch(struct task_struct *task)
{
+ struct perf_cgroup *cgrp;
struct perf_cpu_context *cpuctx, *tmp;
struct list_head *list;
unsigned long flags;
@@ -845,35 +840,31 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
*/
local_irq_save(flags);
+ cgrp = perf_cgroup_from_task(task, NULL);
+
list = this_cpu_ptr(&cgrp_cpuctx_list);
list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+ if (READ_ONCE(cpuctx->cgrp) == cgrp)
+ continue;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(cpuctx->ctx.pmu);
- if (mode & PERF_CGROUP_SWOUT) {
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
- /*
- * must not be done before ctxswout due
- * to event_filter_match() in event_sched_out()
- */
- cpuctx->cgrp = NULL;
- }
+ cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ /*
+ * must not be done before ctxswout due
+ * to update_cgrp_time_from_cpuctx() in
+ * ctx_sched_out()
+ */
+ cpuctx->cgrp = cgrp;
+ /*
+ * set cgrp before ctxsw in to allow
+ * perf_cgroup_set_timestamp() in ctx_sched_in()
+ * to not have to pass task around
+ */
+ cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- if (mode & PERF_CGROUP_SWIN) {
- WARN_ON_ONCE(cpuctx->cgrp);
- /*
- * set cgrp before ctxsw in to allow
- * perf_cgroup_set_timestamp() in ctx_sched_in()
- * to not have to pass task around
- * we pass the cpuctx->ctx to perf_cgroup_from_task()
- * because cgorup events are only per-cpu
- */
- cpuctx->cgrp = perf_cgroup_from_task(task,
- &cpuctx->ctx);
- cpu_ctx_sched_in(cpuctx, EVENT_ALL);
- }
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -881,58 +872,6 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
local_irq_restore(flags);
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(next, NULL);
-
- /*
- * only schedule out current cgroup events if we know
- * that we are switching to a different cgroup. Otherwise,
- * do no touch the cgroup events.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT);
-
- rcu_read_unlock();
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
- struct perf_cgroup *cgrp1;
- struct perf_cgroup *cgrp2 = NULL;
-
- rcu_read_lock();
- /*
- * we come here when we know perf_cgroup_events > 0
- * we do not need to pass the ctx here because we know
- * we are holding the rcu lock
- */
- cgrp1 = perf_cgroup_from_task(task, NULL);
- cgrp2 = perf_cgroup_from_task(prev, NULL);
-
- /*
- * only need to schedule in cgroup events if we are changing
- * cgroup during ctxsw. Cgroup events were not scheduled
- * out of ctxsw out if that was not the case.
- */
- if (cgrp1 != cgrp2)
- perf_cgroup_switch(task, PERF_CGROUP_SWIN);
-
- rcu_read_unlock();
-}
-
static int perf_cgroup_ensure_storage(struct perf_event *event,
struct cgroup_subsys_state *css)
{
@@ -1096,16 +1035,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
{
}
-static inline void perf_cgroup_sched_out(struct task_struct *task,
- struct task_struct *next)
-{
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
- struct task_struct *task)
-{
-}
-
static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
struct perf_event_attr *attr,
struct perf_event *group_leader)
@@ -1118,11 +1047,6 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
{
}
-static inline void
-perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
static inline u64 perf_cgroup_event_time(struct perf_event *event)
{
return 0;
@@ -1142,6 +1066,10 @@ static inline void
perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
{
}
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
#endif
/*
@@ -3661,7 +3589,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
* cgroup event are system-wide mode only
*/
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_out(task, next);
+ perf_cgroup_switch(next);
}
/*
@@ -3975,16 +3903,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
struct perf_event_context *ctx;
int ctxn;
- /*
- * If cgroup events exist on this CPU, then we need to check if we have
- * to switch in PMU state; cgroup event are system-wide mode only.
- *
- * Since cgroup events are CPU events, we must schedule these in before
- * we schedule in the task events.
- */
- if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
- perf_cgroup_sched_in(prev, task);
-
for_each_task_context_nr(ctxn) {
ctx = task->perf_event_ctxp[ctxn];
if (likely(!ctx))
@@ -13556,7 +13474,7 @@ static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
rcu_read_lock();
- perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN);
+ perf_cgroup_switch(task);
rcu_read_unlock();
return 0;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4a9c7bbe2ed4d2b240674b1fb606c41d3940c412 Mon Sep 17 00:00:00 2001
From: Martin KaFai Lau <kafai(a)fb.com>
Date: Tue, 29 Mar 2022 18:14:56 -0700
Subject: [PATCH] bpf: Resolve to prog->aux->dst_prog->type only for
BPF_PROG_TYPE_EXT
The commit 7e40781cc8b7 ("bpf: verifier: Use target program's type for access verifications")
fixes the verifier checking for BPF_PROG_TYPE_EXT (extension)
prog such that the verifier looks for things based
on the target prog type that it is extending instead of
the BPF_PROG_TYPE_EXT itself.
The current resolve_prog_type() returns the target prog type.
It checks for nullness on prog->aux->dst_prog. However,
when loading a BPF_PROG_TYPE_TRACING prog and it is tracing another
bpf prog instead of a kernel function, prog->aux->dst_prog is not
NULL also. In this case, the verifier should still verify as the
BPF_PROG_TYPE_TRACING type instead of the traced prog type in
prog->aux->dst_prog->type.
An oops has been reported when tracing a struct_ops prog. A NULL
dereference happened in check_return_code() when accessing the
prog->aux->attach_func_proto->type and prog->aux->attach_func_proto
is NULL here because the traced struct_ops prog has the "unreliable" set.
This patch is to change the resolve_prog_type() to only
return the target prog type if the prog being verified is
BPF_PROG_TYPE_EXT.
Fixes: 7e40781cc8b7 ("bpf: verifier: Use target program's type for access verifications")
Signed-off-by: Martin KaFai Lau <kafai(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Yonghong Song <yhs(a)fb.com>
Link: https://lore.kernel.org/bpf/20220330011456.2984509-1-kafai@fb.com
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c1fc4af47f69..3a9d2d7cc6b7 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -570,9 +570,11 @@ static inline u32 type_flag(u32 type)
return type & ~BPF_BASE_TYPE_MASK;
}
+/* only use after check_attach_btf_id() */
static inline enum bpf_prog_type resolve_prog_type(struct bpf_prog *prog)
{
- return prog->aux->dst_prog ? prog->aux->dst_prog->type : prog->type;
+ return prog->type == BPF_PROG_TYPE_EXT ?
+ prog->aux->dst_prog->type : prog->type;
}
#endif /* _LINUX_BPF_VERIFIER_H */
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4a9c7bbe2ed4d2b240674b1fb606c41d3940c412 Mon Sep 17 00:00:00 2001
From: Martin KaFai Lau <kafai(a)fb.com>
Date: Tue, 29 Mar 2022 18:14:56 -0700
Subject: [PATCH] bpf: Resolve to prog->aux->dst_prog->type only for
BPF_PROG_TYPE_EXT
The commit 7e40781cc8b7 ("bpf: verifier: Use target program's type for access verifications")
fixes the verifier checking for BPF_PROG_TYPE_EXT (extension)
prog such that the verifier looks for things based
on the target prog type that it is extending instead of
the BPF_PROG_TYPE_EXT itself.
The current resolve_prog_type() returns the target prog type.
It checks for nullness on prog->aux->dst_prog. However,
when loading a BPF_PROG_TYPE_TRACING prog and it is tracing another
bpf prog instead of a kernel function, prog->aux->dst_prog is not
NULL also. In this case, the verifier should still verify as the
BPF_PROG_TYPE_TRACING type instead of the traced prog type in
prog->aux->dst_prog->type.
An oops has been reported when tracing a struct_ops prog. A NULL
dereference happened in check_return_code() when accessing the
prog->aux->attach_func_proto->type and prog->aux->attach_func_proto
is NULL here because the traced struct_ops prog has the "unreliable" set.
This patch is to change the resolve_prog_type() to only
return the target prog type if the prog being verified is
BPF_PROG_TYPE_EXT.
Fixes: 7e40781cc8b7 ("bpf: verifier: Use target program's type for access verifications")
Signed-off-by: Martin KaFai Lau <kafai(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Yonghong Song <yhs(a)fb.com>
Link: https://lore.kernel.org/bpf/20220330011456.2984509-1-kafai@fb.com
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c1fc4af47f69..3a9d2d7cc6b7 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -570,9 +570,11 @@ static inline u32 type_flag(u32 type)
return type & ~BPF_BASE_TYPE_MASK;
}
+/* only use after check_attach_btf_id() */
static inline enum bpf_prog_type resolve_prog_type(struct bpf_prog *prog)
{
- return prog->aux->dst_prog ? prog->aux->dst_prog->type : prog->type;
+ return prog->type == BPF_PROG_TYPE_EXT ?
+ prog->aux->dst_prog->type : prog->type;
}
#endif /* _LINUX_BPF_VERIFIER_H */
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4a9c7bbe2ed4d2b240674b1fb606c41d3940c412 Mon Sep 17 00:00:00 2001
From: Martin KaFai Lau <kafai(a)fb.com>
Date: Tue, 29 Mar 2022 18:14:56 -0700
Subject: [PATCH] bpf: Resolve to prog->aux->dst_prog->type only for
BPF_PROG_TYPE_EXT
The commit 7e40781cc8b7 ("bpf: verifier: Use target program's type for access verifications")
fixes the verifier checking for BPF_PROG_TYPE_EXT (extension)
prog such that the verifier looks for things based
on the target prog type that it is extending instead of
the BPF_PROG_TYPE_EXT itself.
The current resolve_prog_type() returns the target prog type.
It checks for nullness on prog->aux->dst_prog. However,
when loading a BPF_PROG_TYPE_TRACING prog and it is tracing another
bpf prog instead of a kernel function, prog->aux->dst_prog is not
NULL also. In this case, the verifier should still verify as the
BPF_PROG_TYPE_TRACING type instead of the traced prog type in
prog->aux->dst_prog->type.
An oops has been reported when tracing a struct_ops prog. A NULL
dereference happened in check_return_code() when accessing the
prog->aux->attach_func_proto->type and prog->aux->attach_func_proto
is NULL here because the traced struct_ops prog has the "unreliable" set.
This patch is to change the resolve_prog_type() to only
return the target prog type if the prog being verified is
BPF_PROG_TYPE_EXT.
Fixes: 7e40781cc8b7 ("bpf: verifier: Use target program's type for access verifications")
Signed-off-by: Martin KaFai Lau <kafai(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Yonghong Song <yhs(a)fb.com>
Link: https://lore.kernel.org/bpf/20220330011456.2984509-1-kafai@fb.com
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c1fc4af47f69..3a9d2d7cc6b7 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -570,9 +570,11 @@ static inline u32 type_flag(u32 type)
return type & ~BPF_BASE_TYPE_MASK;
}
+/* only use after check_attach_btf_id() */
static inline enum bpf_prog_type resolve_prog_type(struct bpf_prog *prog)
{
- return prog->aux->dst_prog ? prog->aux->dst_prog->type : prog->type;
+ return prog->type == BPF_PROG_TYPE_EXT ?
+ prog->aux->dst_prog->type : prog->type;
}
#endif /* _LINUX_BPF_VERIFIER_H */
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4a9c7bbe2ed4d2b240674b1fb606c41d3940c412 Mon Sep 17 00:00:00 2001
From: Martin KaFai Lau <kafai(a)fb.com>
Date: Tue, 29 Mar 2022 18:14:56 -0700
Subject: [PATCH] bpf: Resolve to prog->aux->dst_prog->type only for
BPF_PROG_TYPE_EXT
The commit 7e40781cc8b7 ("bpf: verifier: Use target program's type for access verifications")
fixes the verifier checking for BPF_PROG_TYPE_EXT (extension)
prog such that the verifier looks for things based
on the target prog type that it is extending instead of
the BPF_PROG_TYPE_EXT itself.
The current resolve_prog_type() returns the target prog type.
It checks for nullness on prog->aux->dst_prog. However,
when loading a BPF_PROG_TYPE_TRACING prog and it is tracing another
bpf prog instead of a kernel function, prog->aux->dst_prog is not
NULL also. In this case, the verifier should still verify as the
BPF_PROG_TYPE_TRACING type instead of the traced prog type in
prog->aux->dst_prog->type.
An oops has been reported when tracing a struct_ops prog. A NULL
dereference happened in check_return_code() when accessing the
prog->aux->attach_func_proto->type and prog->aux->attach_func_proto
is NULL here because the traced struct_ops prog has the "unreliable" set.
This patch is to change the resolve_prog_type() to only
return the target prog type if the prog being verified is
BPF_PROG_TYPE_EXT.
Fixes: 7e40781cc8b7 ("bpf: verifier: Use target program's type for access verifications")
Signed-off-by: Martin KaFai Lau <kafai(a)fb.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Yonghong Song <yhs(a)fb.com>
Link: https://lore.kernel.org/bpf/20220330011456.2984509-1-kafai@fb.com
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c1fc4af47f69..3a9d2d7cc6b7 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -570,9 +570,11 @@ static inline u32 type_flag(u32 type)
return type & ~BPF_BASE_TYPE_MASK;
}
+/* only use after check_attach_btf_id() */
static inline enum bpf_prog_type resolve_prog_type(struct bpf_prog *prog)
{
- return prog->aux->dst_prog ? prog->aux->dst_prog->type : prog->type;
+ return prog->type == BPF_PROG_TYPE_EXT ?
+ prog->aux->dst_prog->type : prog->type;
}
#endif /* _LINUX_BPF_VERIFIER_H */
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1448769c9cdb69ad65287f4f7ab58bc5f2f5d7ba Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 5 Apr 2022 18:39:31 +0200
Subject: [PATCH] random: check for signal_pending() outside of need_resched()
check
signal_pending() checks TIF_NOTIFY_SIGNAL and TIF_SIGPENDING, which
signal that the task should bail out of the syscall when possible. This
is a separate concept from need_resched(), which checks
TIF_NEED_RESCHED, signaling that the task should preempt.
In particular, with the current code, the signal_pending() bailout
probably won't work reliably.
Change this to look like other functions that read lots of data, such as
read_zero().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 47f01b1482a9..394cbd814a0b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -549,13 +549,13 @@ static ssize_t get_random_bytes_user(void __user *buf, size_t nbytes)
}
do {
- if (large_request && need_resched()) {
+ if (large_request) {
if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}
- schedule();
+ cond_resched();
}
chacha20_block(chacha_state, output);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1448769c9cdb69ad65287f4f7ab58bc5f2f5d7ba Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 5 Apr 2022 18:39:31 +0200
Subject: [PATCH] random: check for signal_pending() outside of need_resched()
check
signal_pending() checks TIF_NOTIFY_SIGNAL and TIF_SIGPENDING, which
signal that the task should bail out of the syscall when possible. This
is a separate concept from need_resched(), which checks
TIF_NEED_RESCHED, signaling that the task should preempt.
In particular, with the current code, the signal_pending() bailout
probably won't work reliably.
Change this to look like other functions that read lots of data, such as
read_zero().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 47f01b1482a9..394cbd814a0b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -549,13 +549,13 @@ static ssize_t get_random_bytes_user(void __user *buf, size_t nbytes)
}
do {
- if (large_request && need_resched()) {
+ if (large_request) {
if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}
- schedule();
+ cond_resched();
}
chacha20_block(chacha_state, output);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1448769c9cdb69ad65287f4f7ab58bc5f2f5d7ba Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 5 Apr 2022 18:39:31 +0200
Subject: [PATCH] random: check for signal_pending() outside of need_resched()
check
signal_pending() checks TIF_NOTIFY_SIGNAL and TIF_SIGPENDING, which
signal that the task should bail out of the syscall when possible. This
is a separate concept from need_resched(), which checks
TIF_NEED_RESCHED, signaling that the task should preempt.
In particular, with the current code, the signal_pending() bailout
probably won't work reliably.
Change this to look like other functions that read lots of data, such as
read_zero().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 47f01b1482a9..394cbd814a0b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -549,13 +549,13 @@ static ssize_t get_random_bytes_user(void __user *buf, size_t nbytes)
}
do {
- if (large_request && need_resched()) {
+ if (large_request) {
if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}
- schedule();
+ cond_resched();
}
chacha20_block(chacha_state, output);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1448769c9cdb69ad65287f4f7ab58bc5f2f5d7ba Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 5 Apr 2022 18:39:31 +0200
Subject: [PATCH] random: check for signal_pending() outside of need_resched()
check
signal_pending() checks TIF_NOTIFY_SIGNAL and TIF_SIGPENDING, which
signal that the task should bail out of the syscall when possible. This
is a separate concept from need_resched(), which checks
TIF_NEED_RESCHED, signaling that the task should preempt.
In particular, with the current code, the signal_pending() bailout
probably won't work reliably.
Change this to look like other functions that read lots of data, such as
read_zero().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 47f01b1482a9..394cbd814a0b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -549,13 +549,13 @@ static ssize_t get_random_bytes_user(void __user *buf, size_t nbytes)
}
do {
- if (large_request && need_resched()) {
+ if (large_request) {
if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}
- schedule();
+ cond_resched();
}
chacha20_block(chacha_state, output);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1448769c9cdb69ad65287f4f7ab58bc5f2f5d7ba Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 5 Apr 2022 18:39:31 +0200
Subject: [PATCH] random: check for signal_pending() outside of need_resched()
check
signal_pending() checks TIF_NOTIFY_SIGNAL and TIF_SIGPENDING, which
signal that the task should bail out of the syscall when possible. This
is a separate concept from need_resched(), which checks
TIF_NEED_RESCHED, signaling that the task should preempt.
In particular, with the current code, the signal_pending() bailout
probably won't work reliably.
Change this to look like other functions that read lots of data, such as
read_zero().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 47f01b1482a9..394cbd814a0b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -549,13 +549,13 @@ static ssize_t get_random_bytes_user(void __user *buf, size_t nbytes)
}
do {
- if (large_request && need_resched()) {
+ if (large_request) {
if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}
- schedule();
+ cond_resched();
}
chacha20_block(chacha_state, output);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1448769c9cdb69ad65287f4f7ab58bc5f2f5d7ba Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 5 Apr 2022 18:39:31 +0200
Subject: [PATCH] random: check for signal_pending() outside of need_resched()
check
signal_pending() checks TIF_NOTIFY_SIGNAL and TIF_SIGPENDING, which
signal that the task should bail out of the syscall when possible. This
is a separate concept from need_resched(), which checks
TIF_NEED_RESCHED, signaling that the task should preempt.
In particular, with the current code, the signal_pending() bailout
probably won't work reliably.
Change this to look like other functions that read lots of data, such as
read_zero().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 47f01b1482a9..394cbd814a0b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -549,13 +549,13 @@ static ssize_t get_random_bytes_user(void __user *buf, size_t nbytes)
}
do {
- if (large_request && need_resched()) {
+ if (large_request) {
if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}
- schedule();
+ cond_resched();
}
chacha20_block(chacha_state, output);
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1448769c9cdb69ad65287f4f7ab58bc5f2f5d7ba Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 5 Apr 2022 18:39:31 +0200
Subject: [PATCH] random: check for signal_pending() outside of need_resched()
check
signal_pending() checks TIF_NOTIFY_SIGNAL and TIF_SIGPENDING, which
signal that the task should bail out of the syscall when possible. This
is a separate concept from need_resched(), which checks
TIF_NEED_RESCHED, signaling that the task should preempt.
In particular, with the current code, the signal_pending() bailout
probably won't work reliably.
Change this to look like other functions that read lots of data, such as
read_zero().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 47f01b1482a9..394cbd814a0b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -549,13 +549,13 @@ static ssize_t get_random_bytes_user(void __user *buf, size_t nbytes)
}
do {
- if (large_request && need_resched()) {
+ if (large_request) {
if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}
- schedule();
+ cond_resched();
}
chacha20_block(chacha_state, output);
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1448769c9cdb69ad65287f4f7ab58bc5f2f5d7ba Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 5 Apr 2022 18:39:31 +0200
Subject: [PATCH] random: check for signal_pending() outside of need_resched()
check
signal_pending() checks TIF_NOTIFY_SIGNAL and TIF_SIGPENDING, which
signal that the task should bail out of the syscall when possible. This
is a separate concept from need_resched(), which checks
TIF_NEED_RESCHED, signaling that the task should preempt.
In particular, with the current code, the signal_pending() bailout
probably won't work reliably.
Change this to look like other functions that read lots of data, such as
read_zero().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 47f01b1482a9..394cbd814a0b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -549,13 +549,13 @@ static ssize_t get_random_bytes_user(void __user *buf, size_t nbytes)
}
do {
- if (large_request && need_resched()) {
+ if (large_request) {
if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}
- schedule();
+ cond_resched();
}
chacha20_block(chacha_state, output);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a431dbbc540532b7465eae4fc8b56a85a9fc7d17 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman(a)redhat.com>
Date: Fri, 8 Apr 2022 13:09:01 -0700
Subject: [PATCH] mm/sparsemem: fix 'mem_section' will never be NULL gcc 12
warning
The gcc 12 compiler reports a "'mem_section' will never be NULL" warning
on the following code:
static inline struct mem_section *__nr_to_section(unsigned long nr)
{
#ifdef CONFIG_SPARSEMEM_EXTREME
if (!mem_section)
return NULL;
#endif
if (!mem_section[SECTION_NR_TO_ROOT(nr)])
return NULL;
:
It happens with CONFIG_SPARSEMEM_EXTREME off. The mem_section definition
is
#ifdef CONFIG_SPARSEMEM_EXTREME
extern struct mem_section **mem_section;
#else
extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
#endif
In the !CONFIG_SPARSEMEM_EXTREME case, mem_section is a static
2-dimensional array and so the check "!mem_section[SECTION_NR_TO_ROOT(nr)]"
doesn't make sense.
Fix this warning by moving the "!mem_section[SECTION_NR_TO_ROOT(nr)]"
check up inside the CONFIG_SPARSEMEM_EXTREME block and adding an
explicit NR_SECTION_ROOTS check to make sure that there is no
out-of-bound array access.
Link: https://lkml.kernel.org/r/20220331180246.2746210-1-longman@redhat.com
Fixes: 3e347261a80b ("sparsemem extreme implementation")
Signed-off-by: Waiman Long <longman(a)redhat.com>
Reported-by: Justin Forbes <jforbes(a)redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Rafael Aquini <aquini(a)redhat.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 962b14d403e8..46ffab808f03 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1397,13 +1397,16 @@ static inline unsigned long *section_to_usemap(struct mem_section *ms)
static inline struct mem_section *__nr_to_section(unsigned long nr)
{
+ unsigned long root = SECTION_NR_TO_ROOT(nr);
+
+ if (unlikely(root >= NR_SECTION_ROOTS))
+ return NULL;
+
#ifdef CONFIG_SPARSEMEM_EXTREME
- if (!mem_section)
+ if (!mem_section || !mem_section[root])
return NULL;
#endif
- if (!mem_section[SECTION_NR_TO_ROOT(nr)])
- return NULL;
- return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
+ return &mem_section[root][nr & SECTION_ROOT_MASK];
}
extern size_t mem_section_usage_size(void);
Hi Greg and Sasha,
Please find attached backports for commits 4013e26670c5 ("arm64: module:
remove (NOLOAD) from linker script") and 60210a3d86dc ("riscv module:
remove (NOLOAD)") to 4.9 through 5.10, where applicable. They resolve a
spew of warnings when linking modules with ld.lld. 4013e26670c5 is
currently queued from 5.15 to 5.17 and 60210a3d86dc is queued for 5.10
through 5.17, as that is where they applied cleanly.
Cheers,
Nathan
v2:
- Added patch to disable TSX development mode (Andrew, Boris)
- Rebased to v5.17-rc7
v1: https://lore.kernel.org/lkml/5bd785a1d6ea0b572250add0c6617b4504bc24d1.16444…
Hi,
After a recent microcode update some Intel processors will always abort
Transactional Synchronization Extensions (TSX) transactions [*]. On
these processors a new CPUID bit,
CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT), will be enumerated to indicate
that the loaded microcode is forcing Restricted Transactional Memory
(RTM) abort. If the processor enumerated support for RTM previously, the
CPUID enumeration bits for TSX (CPUID.RTM and CPUID.HLE) continue to be
set by default after the microcode update.
First patch in this series clears CPUID.RTM and CPUID.HLE so that
userspace doesn't enumerate TSX feature.
Microcode also added support to re-enable TSX for development purpose,
doing so is not recommended for production deployments, because MD_CLEAR
flows for the mitigation of TSX Asynchronous Abort (TAA) may not be
effective on these processors when TSX is enabled with updated
microcode.
Second patch unconditionally disables this TSX development mode, in case
it was enabled by the software running before kernel boot.
Thanks,
Pawan
[*] Intel Transactional Synchronization Extension (Intel TSX) Disable Update for Selected Processors
https://cdrdv2.intel.com/v1/dl/getContent/643557
Pawan Gupta (2):
x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
x86/tsx: Disable TSX development mode at boot
arch/x86/include/asm/msr-index.h | 4 +-
arch/x86/kernel/cpu/cpu.h | 1 +
arch/x86/kernel/cpu/intel.c | 5 ++
arch/x86/kernel/cpu/tsx.c | 88 ++++++++++++++++++++++++--
tools/arch/x86/include/asm/msr-index.h | 4 +-
5 files changed, 91 insertions(+), 11 deletions(-)
base-commit: ffb217a13a2eaf6d5bd974fc83036a53ca69f1e2
--
2.25.1
There is a bug in the aoe driver module where every forcible removal of
an aoe device (eg. "rmmod aoe" with aoe devices available or "aoe-flush
ex.x") leads to a page fault.
The code in freedev() calls blk_mq_free_tag_set() before running
blk_cleanup_queue() which leads to this issue
(drivers/block/aoe/aoedev.c L281ff).
This issue was fixed upstream in commit 6560ec9 (aoe: use
blk_mq_alloc_disk and blk_cleanup_disk) with the introduction and use of
the function blk_cleanup_disk().
This patch applies to kernels 5.4 and 5.10.
The function calls are reordered to match the behavior of
blk_cleanup_disk() to mitigate this issue.
Fixes: 3582dd2 (aoe: convert aoeblk to blk-mq)
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215647
Signed-off-by: Valentin Kleibel <valentin(a)vrvis.at>
---
drivers/block/aoe/aoedev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c
index e2ea2356da06..08c98ea724ea 100644
--- a/drivers/block/aoe/aoedev.c
+++ b/drivers/block/aoe/aoedev.c
@@ -277,9 +277,9 @@ freedev(struct aoedev *d)
if (d->gd) {
aoedisk_rm_debugfs(d);
del_gendisk(d->gd);
+ blk_cleanup_queue(d->blkq);
put_disk(d->gd);
blk_mq_free_tag_set(&d->tag_set);
- blk_cleanup_queue(d->blkq);
}
t = d->targets;
e = t + d->ntargets;
CVE-2021-4197 patchset consists of:
[1] 1756d7994ad8 ("cgroup: Use open-time credentials for process migraton perm checks")
[2] 0d2b5955b362 ("cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv")
[3] e57457641613 ("cgroup: Use open-time cgroup namespace for process migration perm checks")
[4] b09c2baa5634 ("selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644")
[5] 613e040e4dc2 ("selftests: cgroup: Test open-time credential usage for migration checks")
[6] bf35a7879f1d ("selftests: cgroup: Test open-time cgroup namespace usage for migration checks")
Commits [2] and [3] are already preent in 5.10-stable, this patchset includes
backports for the other commits.
Backport summary
----------------
1756d7994ad8 ("cgroup: Use open-time credentials for process migraton perm checks")
* Refactoring commit da70862efe006 ("cgroup: cgroup.{procs,threads}
factor out common parts") is not present in kernel versions < 5.12,
so the original changes to __cgroup_procs_write() had to be applied
in both cgroup_threads_write() and cgroup_procs_write() functions.
c2e46f6b3e35 ("selftests/cgroup: Fix build on older distros")
* This extra commit was added to fix the following selftest build
failure, applies cleanly:
...
cgroup_util.c: In function ‘clone_into_cgroup’:
group_util.c:343:4: error: ‘struct clone_args’ has no member named ‘cgroup’
343 | .cgroup = cgroup_fd,
| ^~~~~~
All other selftest changes are clean cherry-picks.
Testing
-------
The newly introduced selftests (test_cgcore_lesser_euid_open() and
test_cgcore_lesser_ns_open()) pass with this series applied:
root@intel-x86-64:~# ./test_core
ok 1 test_cgcore_internal_process_constraint
ok 2 test_cgcore_top_down_constraint_enable
ok 3 test_cgcore_top_down_constraint_disable
ok 4 test_cgcore_no_internal_process_constraint_os
ok 5 test_cgcore_parent_becomes_threaded
ok 6 test_cgcore_invalid_domain
ok 7 test_cgcore_populated
ok 8 test_cgcore_proc_migration
ok 9 test_cgcore_thread_migration
ok 10 test_cgcore_destroy
ok 11 test_cgcore_lesser_euid_open
ok 12 test_cgcore_lesser_ns_open
Sachin Sant (1):
selftests/cgroup: Fix build on older distros
Tejun Heo (4):
cgroup: Use open-time credentials for process migraton perm checks
selftests: cgroup: Make cg_create() use 0755 for permission instead of
0644
selftests: cgroup: Test open-time credential usage for migration
checks
selftests: cgroup: Test open-time cgroup namespace usage for migration
checks
kernel/cgroup/cgroup-v1.c | 7 +-
kernel/cgroup/cgroup.c | 17 +-
tools/testing/selftests/cgroup/cgroup_util.c | 6 +-
tools/testing/selftests/cgroup/test_core.c | 165 +++++++++++++++++++
4 files changed, 188 insertions(+), 7 deletions(-)
--
2.25.1
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5abfd71d936a8aefd9f9ccd299dea7a164a5d455 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx(a)redhat.com>
Date: Tue, 22 Mar 2022 14:42:15 -0700
Subject: [PATCH] mm: don't skip swap entry even if zap_details specified
Patch series "mm: Rework zap ptes on swap entries", v5.
Patch 1 should fix a long standing bug for zap_pte_range() on
zap_details usage. The risk is we could have some swap entries skipped
while we should have zapped them.
Migration entries are not the major concern because file backed memory
always zap in the pattern that "first time without page lock, then
re-zap with page lock" hence the 2nd zap will always make sure all
migration entries are already recovered.
However there can be issues with real swap entries got skipped
errornoously. There's a reproducer provided in commit message of patch
1 for that.
Patch 2-4 are cleanups that are based on patch 1. After the whole
patchset applied, we should have a very clean view of zap_pte_range().
Only patch 1 needs to be backported to stable if necessary.
This patch (of 4):
The "details" pointer shouldn't be the token to decide whether we should
skip swap entries.
For example, when the callers specified details->zap_mapping==NULL, it
means the user wants to zap all the pages (including COWed pages), then
we need to look into swap entries because there can be private COWed
pages that was swapped out.
Skipping some swap entries when details is non-NULL may lead to wrongly
leaving some of the swap entries while we should have zapped them.
A reproducer of the problem:
===8<===
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <stdio.h>
#include <assert.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
int page_size;
int shmem_fd;
char *buffer;
void main(void)
{
int ret;
char val;
page_size = getpagesize();
shmem_fd = memfd_create("test", 0);
assert(shmem_fd >= 0);
ret = ftruncate(shmem_fd, page_size * 2);
assert(ret == 0);
buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
MAP_PRIVATE, shmem_fd, 0);
assert(buffer != MAP_FAILED);
/* Write private page, swap it out */
buffer[page_size] = 1;
madvise(buffer, page_size * 2, MADV_PAGEOUT);
/* This should drop private buffer[page_size] already */
ret = ftruncate(shmem_fd, page_size);
assert(ret == 0);
/* Recover the size */
ret = ftruncate(shmem_fd, page_size * 2);
assert(ret == 0);
/* Re-read the data, it should be all zero */
val = buffer[page_size];
if (val == 0)
printf("Good\n");
else
printf("BUG\n");
}
===8<===
We don't need to touch up the pmd path, because pmd never had a issue with
swap entries. For example, shmem pmd migration will always be split into
pte level, and same to swapping on anonymous.
Add another helper should_zap_cows() so that we can also check whether we
should zap private mappings when there's no page pointer specified.
This patch drops that trick, so we handle swap ptes coherently. Meanwhile
we should do the same check upon migration entry, hwpoison entry and
genuine swap entries too.
To be explicit, we should still remember to keep the private entries if
even_cows==false, and always zap them when even_cows==true.
The issue seems to exist starting from the initial commit of git.
[peterx(a)redhat.com: comment tweaks]
Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: John Hubbard <jhubbard(a)nvidia.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Kirill A . Shutemov" <kirill(a)shutemov.name>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory.c b/mm/memory.c
index 30e6d0248a3d..a7bf87cf2ba0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1313,6 +1313,17 @@ struct zap_details {
struct folio *single_folio; /* Locked folio to be unmapped */
};
+/* Whether we should zap all COWed (private) pages too */
+static inline bool should_zap_cows(struct zap_details *details)
+{
+ /* By default, zap all pages */
+ if (!details)
+ return true;
+
+ /* Or, we zap COWed pages only if the caller wants to */
+ return !details->zap_mapping;
+}
+
/*
* We set details->zap_mapping when we want to unmap shared but keep private
* pages. Return true if skip zapping this page, false otherwise.
@@ -1320,11 +1331,15 @@ struct zap_details {
static inline bool
zap_skip_check_mapping(struct zap_details *details, struct page *page)
{
- if (!details || !page)
+ /* If we can make a decision without *page.. */
+ if (should_zap_cows(details))
+ return false;
+
+ /* E.g. the caller passes NULL for the case of a zero page */
+ if (!page)
return false;
- return details->zap_mapping &&
- (details->zap_mapping != page_rmapping(page));
+ return details->zap_mapping != page_rmapping(page);
}
static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -1405,17 +1420,24 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}
- /* If details->check_mapping, we leave swap entries. */
- if (unlikely(details))
- continue;
-
- if (!non_swap_entry(entry))
+ if (!non_swap_entry(entry)) {
+ /* Genuine swap entry, hence a private anon page */
+ if (!should_zap_cows(details))
+ continue;
rss[MM_SWAPENTS]--;
- else if (is_migration_entry(entry)) {
+ } else if (is_migration_entry(entry)) {
struct page *page;
page = pfn_swap_entry_to_page(entry);
+ if (zap_skip_check_mapping(details, page))
+ continue;
rss[mm_counter(page)]--;
+ } else if (is_hwpoison_entry(entry)) {
+ if (!should_zap_cows(details))
+ continue;
+ } else {
+ /* We should have covered all the swap entry types */
+ WARN_ON_ONCE(1);
}
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
CVE-2021-4197 patchset consists of:
[1] 1756d7994ad8 ("cgroup: Use open-time credentials for process migraton perm checks")
[2] 0d2b5955b362 ("cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv")
[3] e57457641613 ("cgroup: Use open-time cgroup namespace for process migration perm checks")
[4] b09c2baa5634 ("selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644")
[5] 613e040e4dc2 ("selftests: cgroup: Test open-time credential usage for migration checks")
[6] bf35a7879f1d ("selftests: cgroup: Test open-time cgroup namespace usage for migration checks")
Commits [1], [2] and [3] are already present in 5.15-stable, this patchset
includes backports for the selftests. All patches are clean cherry-picks.
The newly introduced selftests (test_cgcore_lesser_euid_open() and
test_cgcore_lesser_ns_open()) pass with this series applied:
root@intel-x86-64:~# ./test_core
ok 1 test_cgcore_internal_process_constraint
ok 2 test_cgcore_top_down_constraint_enable
ok 3 test_cgcore_top_down_constraint_disable
ok 4 test_cgcore_no_internal_process_constraint_os
ok 5 test_cgcore_parent_becomes_threaded
ok 6 test_cgcore_invalid_domain
ok 7 test_cgcore_populated
ok 8 test_cgcore_proc_migration
ok 9 test_cgcore_thread_migration
ok 10 test_cgcore_destroy
ok 11 test_cgcore_lesser_euid_open
ok 12 test_cgcore_lesser_ns_open
Tejun Heo (3):
selftests: cgroup: Make cg_create() use 0755 for permission instead of
0644
selftests: cgroup: Test open-time credential usage for migration
checks
selftests: cgroup: Test open-time cgroup namespace usage for migration
checks
tools/testing/selftests/cgroup/cgroup_util.c | 2 +-
tools/testing/selftests/cgroup/test_core.c | 165 +++++++++++++++++++
2 files changed, 166 insertions(+), 1 deletion(-)
--
2.25.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0d319dd5a27183b75d984e3dc495248e59f99334 Mon Sep 17 00:00:00 2001
From: Yann Gautier <yann.gautier(a)foss.st.com>
Date: Thu, 17 Mar 2022 12:19:43 +0100
Subject: [PATCH] mmc: mmci: stm32: correctly check all elements of sg list
Use sg and not data->sg when checking sg list elements. Else only the
first element alignment is checked.
The last element should be checked the same way, for_each_sg already set
sg to sg_next(sg).
Fixes: 46b723dd867d ("mmc: mmci: add stm32 sdmmc variant")
Cc: stable(a)vger.kernel.org
Signed-off-by: Yann Gautier <yann.gautier(a)foss.st.com>
Link: https://lore.kernel.org/r/20220317111944.116148-2-yann.gautier@foss.st.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/mmci_stm32_sdmmc.c b/drivers/mmc/host/mmci_stm32_sdmmc.c
index 9c13f2c31365..4566d7fc9055 100644
--- a/drivers/mmc/host/mmci_stm32_sdmmc.c
+++ b/drivers/mmc/host/mmci_stm32_sdmmc.c
@@ -62,8 +62,8 @@ static int sdmmc_idma_validate_data(struct mmci_host *host,
* excepted the last element which has no constraint on idmasize
*/
for_each_sg(data->sg, sg, data->sg_len - 1, i) {
- if (!IS_ALIGNED(data->sg->offset, sizeof(u32)) ||
- !IS_ALIGNED(data->sg->length, SDMMC_IDMA_BURST)) {
+ if (!IS_ALIGNED(sg->offset, sizeof(u32)) ||
+ !IS_ALIGNED(sg->length, SDMMC_IDMA_BURST)) {
dev_err(mmc_dev(host->mmc),
"unaligned scatterlist: ofst:%x length:%d\n",
data->sg->offset, data->sg->length);
@@ -71,7 +71,7 @@ static int sdmmc_idma_validate_data(struct mmci_host *host,
}
}
- if (!IS_ALIGNED(data->sg->offset, sizeof(u32))) {
+ if (!IS_ALIGNED(sg->offset, sizeof(u32))) {
dev_err(mmc_dev(host->mmc),
"unaligned last scatterlist: ofst:%x length:%d\n",
data->sg->offset, data->sg->length);
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 063452fd94d153d4eb38ad58f210f3d37a09cca4 Mon Sep 17 00:00:00 2001
From: Yang Zhong <yang.zhong(a)intel.com>
Date: Sat, 29 Jan 2022 09:36:46 -0800
Subject: [PATCH] x86/fpu/xstate: Fix the ARCH_REQ_XCOMP_PERM implementation
ARCH_REQ_XCOMP_PERM is supposed to add the requested feature to the
permission bitmap of thread_group_leader()->fpu. But the code overwrites
the bitmap with the requested feature bit only rather than adding it.
Fix the code to add the requested feature bit to the master bitmask.
Fixes: db8268df0983 ("x86/arch_prctl: Add controls for dynamic XSTATE components")
Signed-off-by: Yang Zhong <yang.zhong(a)intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae(a)intel.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Paolo Bonzini <bonzini(a)gnu.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220129173647.27981-2-chang.seok.bae@intel.com
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 7c7824ae7862..dc6d5e98d296 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1639,7 +1639,7 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
perm = guest ? &fpu->guest_perm : &fpu->perm;
/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
- WRITE_ONCE(perm->__state_perm, requested);
+ WRITE_ONCE(perm->__state_perm, mask);
/* Protected by sighand lock */
perm->__state_size = ksize;
perm->__user_state_size = usize;
From: Kees Cook <keescook(a)chromium.org>
Upstream commit: 69d0db01e210 ("ubsan: remove CONFIG_UBSAN_OBJECT_SIZE")
The object-size sanitizer is redundant to -Warray-bounds, and
inappropriately performs its checks at run-time when all information
needed for the evaluation is available at compile-time, making it quite
difficult to use:
https://bugzilla.kernel.org/show_bug.cgi?id=214861
This run-time object-size checks also trigger false-positive errors,
like the below, that make it quite difficult to test stable kernels in
test automations like syzkaller:
https://syzkaller.appspot.com/text?tag=Error&x=12b3aac3700000
With -Warray-bounds almost enabled globally, it doesn't make sense to
keep this around.
Link: https://lkml.kernel.org/r/20211203235346.110809-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Reviewed-by: Marco Elver <elver(a)google.com>
Cc: Masahiro Yamada <masahiroy(a)kernel.org>
Cc: Michal Marek <michal.lkml(a)markovi.net>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: "Peter Zijlstra (Intel)" <peterz(a)infradead.org>
Cc: Stephen Rothwell <sfr(a)canb.auug.org.au>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Tadeusz Struk <tadeusz.struk(a)linaro.org>
---
lib/test_ubsan.c | 11 -----------
scripts/Makefile.ubsan | 1 -
2 files changed, 12 deletions(-)
diff --git a/lib/test_ubsan.c b/lib/test_ubsan.c
index 9ea10adf7a66..b1d0a6ecfe1b 100644
--- a/lib/test_ubsan.c
+++ b/lib/test_ubsan.c
@@ -89,16 +89,6 @@ static void test_ubsan_misaligned_access(void)
*ptr = val;
}
-static void test_ubsan_object_size_mismatch(void)
-{
- /* "((aligned(8)))" helps this not into be misaligned for ptr-access. */
- volatile int val __aligned(8) = 4;
- volatile long long *ptr, val2;
-
- ptr = (long long *)&val;
- val2 = *ptr;
-}
-
static const test_ubsan_fp test_ubsan_array[] = {
test_ubsan_add_overflow,
test_ubsan_sub_overflow,
@@ -110,7 +100,6 @@ static const test_ubsan_fp test_ubsan_array[] = {
test_ubsan_load_invalid_value,
//test_ubsan_null_ptr_deref, /* exclude it because there is a crash */
test_ubsan_misaligned_access,
- test_ubsan_object_size_mismatch,
};
static int __init test_ubsan_init(void)
diff --git a/scripts/Makefile.ubsan b/scripts/Makefile.ubsan
index 9716dab06bc7..2156e18391a3 100644
--- a/scripts/Makefile.ubsan
+++ b/scripts/Makefile.ubsan
@@ -23,7 +23,6 @@ ifdef CONFIG_UBSAN_MISC
CFLAGS_UBSAN += $(call cc-option, -fsanitize=integer-divide-by-zero)
CFLAGS_UBSAN += $(call cc-option, -fsanitize=unreachable)
CFLAGS_UBSAN += $(call cc-option, -fsanitize=signed-integer-overflow)
- CFLAGS_UBSAN += $(call cc-option, -fsanitize=object-size)
CFLAGS_UBSAN += $(call cc-option, -fsanitize=bool)
CFLAGS_UBSAN += $(call cc-option, -fsanitize=enum)
endif
--
2.35.1
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 058ec4a7d9cf77238c73ad9f1e1a3ed9a29afcab Mon Sep 17 00:00:00 2001
From: Jakub Sitnicki <jakub(a)cloudflare.com>
Date: Sat, 19 Mar 2022 19:33:54 +0100
Subject: [PATCH] bpf: Treat bpf_sk_lookup remote_port as a 2-byte field
In commit 9a69e2b385f4 ("bpf: Make remote_port field in struct
bpf_sk_lookup 16-bit wide") the remote_port field has been split up and
re-declared from u32 to be16.
However, the accompanying changes to the context access converter have not
been well thought through when it comes big-endian platforms.
Today 2-byte wide loads from offsetof(struct bpf_sk_lookup, remote_port)
are handled as narrow loads from a 4-byte wide field.
This by itself is not enough to create a problem, but when we combine
1. 32-bit wide access to ->remote_port backed by a 16-wide wide load, with
2. inherent difference between litte- and big-endian in how narrow loads
need have to be handled (see bpf_ctx_narrow_access_offset),
we get inconsistent results for a 2-byte loads from &ctx->remote_port on LE
and BE architectures. This in turn makes BPF C code for the common case of
2-byte load from ctx->remote_port not portable.
To rectify it, inform the context access converter that remote_port is
2-byte wide field, and only 1-byte loads need to be treated as narrow
loads.
At the same time, we special-case the 4-byte load from &ctx->remote_port to
continue handling it the same way as do today, in order to keep the
existing BPF programs working.
Fixes: 9a69e2b385f4 ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide")
Signed-off-by: Jakub Sitnicki <jakub(a)cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Martin KaFai Lau <kafai(a)fb.com>
Link: https://lore.kernel.org/bpf/20220319183356.233666-2-jakub@cloudflare.com
diff --git a/net/core/filter.c b/net/core/filter.c
index 03655f2074ae..a7044e98765e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10989,13 +10989,24 @@ static bool sk_lookup_is_valid_access(int off, int size,
case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
- case offsetof(struct bpf_sk_lookup, remote_port) ...
- offsetof(struct bpf_sk_lookup, local_ip4) - 1:
case bpf_ctx_range(struct bpf_sk_lookup, local_port):
case bpf_ctx_range(struct bpf_sk_lookup, ingress_ifindex):
bpf_ctx_record_field_size(info, sizeof(__u32));
return bpf_ctx_narrow_access_ok(off, size, sizeof(__u32));
+ case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
+ /* Allow 4-byte access to 2-byte field for backward compatibility */
+ if (size == sizeof(__u32))
+ return true;
+ bpf_ctx_record_field_size(info, sizeof(__be16));
+ return bpf_ctx_narrow_access_ok(off, size, sizeof(__be16));
+
+ case offsetofend(struct bpf_sk_lookup, remote_port) ...
+ offsetof(struct bpf_sk_lookup, local_ip4) - 1:
+ /* Allow access to zero padding for backward compatibility */
+ bpf_ctx_record_field_size(info, sizeof(__u16));
+ return bpf_ctx_narrow_access_ok(off, size, sizeof(__u16));
+
default:
return false;
}
@@ -11077,6 +11088,11 @@ static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
sport, 2, target_size));
break;
+ case offsetofend(struct bpf_sk_lookup, remote_port):
+ *target_size = 2;
+ *insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+ break;
+
case offsetof(struct bpf_sk_lookup, local_port):
*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
bpf_target_off(struct bpf_sk_lookup_kern,
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 058ec4a7d9cf77238c73ad9f1e1a3ed9a29afcab Mon Sep 17 00:00:00 2001
From: Jakub Sitnicki <jakub(a)cloudflare.com>
Date: Sat, 19 Mar 2022 19:33:54 +0100
Subject: [PATCH] bpf: Treat bpf_sk_lookup remote_port as a 2-byte field
In commit 9a69e2b385f4 ("bpf: Make remote_port field in struct
bpf_sk_lookup 16-bit wide") the remote_port field has been split up and
re-declared from u32 to be16.
However, the accompanying changes to the context access converter have not
been well thought through when it comes big-endian platforms.
Today 2-byte wide loads from offsetof(struct bpf_sk_lookup, remote_port)
are handled as narrow loads from a 4-byte wide field.
This by itself is not enough to create a problem, but when we combine
1. 32-bit wide access to ->remote_port backed by a 16-wide wide load, with
2. inherent difference between litte- and big-endian in how narrow loads
need have to be handled (see bpf_ctx_narrow_access_offset),
we get inconsistent results for a 2-byte loads from &ctx->remote_port on LE
and BE architectures. This in turn makes BPF C code for the common case of
2-byte load from ctx->remote_port not portable.
To rectify it, inform the context access converter that remote_port is
2-byte wide field, and only 1-byte loads need to be treated as narrow
loads.
At the same time, we special-case the 4-byte load from &ctx->remote_port to
continue handling it the same way as do today, in order to keep the
existing BPF programs working.
Fixes: 9a69e2b385f4 ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide")
Signed-off-by: Jakub Sitnicki <jakub(a)cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Martin KaFai Lau <kafai(a)fb.com>
Link: https://lore.kernel.org/bpf/20220319183356.233666-2-jakub@cloudflare.com
diff --git a/net/core/filter.c b/net/core/filter.c
index 03655f2074ae..a7044e98765e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10989,13 +10989,24 @@ static bool sk_lookup_is_valid_access(int off, int size,
case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
- case offsetof(struct bpf_sk_lookup, remote_port) ...
- offsetof(struct bpf_sk_lookup, local_ip4) - 1:
case bpf_ctx_range(struct bpf_sk_lookup, local_port):
case bpf_ctx_range(struct bpf_sk_lookup, ingress_ifindex):
bpf_ctx_record_field_size(info, sizeof(__u32));
return bpf_ctx_narrow_access_ok(off, size, sizeof(__u32));
+ case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
+ /* Allow 4-byte access to 2-byte field for backward compatibility */
+ if (size == sizeof(__u32))
+ return true;
+ bpf_ctx_record_field_size(info, sizeof(__be16));
+ return bpf_ctx_narrow_access_ok(off, size, sizeof(__be16));
+
+ case offsetofend(struct bpf_sk_lookup, remote_port) ...
+ offsetof(struct bpf_sk_lookup, local_ip4) - 1:
+ /* Allow access to zero padding for backward compatibility */
+ bpf_ctx_record_field_size(info, sizeof(__u16));
+ return bpf_ctx_narrow_access_ok(off, size, sizeof(__u16));
+
default:
return false;
}
@@ -11077,6 +11088,11 @@ static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
sport, 2, target_size));
break;
+ case offsetofend(struct bpf_sk_lookup, remote_port):
+ *target_size = 2;
+ *insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+ break;
+
case offsetof(struct bpf_sk_lookup, local_port):
*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
bpf_target_off(struct bpf_sk_lookup_kern,
kongweibin reported a kernel panic in ip6_forward() when input interface
has no in6 dev associated.
The following tc commands were used to reproduce this panic:
tc qdisc del dev vxlan100 root
tc qdisc add dev vxlan100 root netem corrupt 5%
CC: stable(a)vger.kernel.org
Fixes: ccd27f05ae7b ("ipv6: fix 'disable_policy' for fwd packets")
Reported-by: kongweibin <kongweibin2(a)huawei.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
---
kongweibin, could you test this patch with your setup?
Thanks,
Nicolas
net/ipv6/ip6_output.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index e23f058166af..fa63ef2bd99c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -485,7 +485,7 @@ int ip6_forward(struct sk_buff *skb)
goto drop;
if (!net->ipv6.devconf_all->disable_policy &&
- !idev->cnf.disable_policy &&
+ (!idev || !idev->cnf.disable_policy) &&
!xfrm6_policy_check(NULL, XFRM_POLICY_FWD, skb)) {
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
goto drop;
--
2.33.0
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9a69e2b385f443f244a7e8b8bcafe5ccfb0866b4 Mon Sep 17 00:00:00 2001
From: Jakub Sitnicki <jakub(a)cloudflare.com>
Date: Wed, 9 Feb 2022 19:43:32 +0100
Subject: [PATCH] bpf: Make remote_port field in struct bpf_sk_lookup 16-bit
wide
remote_port is another case of a BPF context field documented as a 32-bit
value in network byte order for which the BPF context access converter
generates a load of a zero-padded 16-bit integer in network byte order.
First such case was dst_port in bpf_sock which got addressed in commit
4421a582718a ("bpf: Make dst_port field in struct bpf_sock 16-bit wide").
Loading 4-bytes from the remote_port offset and converting the value with
bpf_ntohl() leads to surprising results, as the expected value is shifted
by 16 bits.
Reduce the confusion by splitting the field in two - a 16-bit field holding
a big-endian integer, and a 16-bit zero-padding anonymous field that
follows it.
Suggested-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Jakub Sitnicki <jakub(a)cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Link: https://lore.kernel.org/bpf/20220209184333.654927-2-jakub@cloudflare.com
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a7f0ddedac1f..afe3d0d7f5f2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6453,7 +6453,8 @@ struct bpf_sk_lookup {
__u32 protocol; /* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
__u32 remote_ip4; /* Network byte order */
__u32 remote_ip6[4]; /* Network byte order */
- __u32 remote_port; /* Network byte order */
+ __be16 remote_port; /* Network byte order */
+ __u16 :16; /* Zero padding */
__u32 local_ip4; /* Network byte order */
__u32 local_ip6[4]; /* Network byte order */
__u32 local_port; /* Host byte order */
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index cb150f756f3d..f08034500813 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1147,7 +1147,7 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
if (!range_is_zero(user_ctx, offsetofend(typeof(*user_ctx), local_port), sizeof(*user_ctx)))
goto out;
- if (user_ctx->local_port > U16_MAX || user_ctx->remote_port > U16_MAX) {
+ if (user_ctx->local_port > U16_MAX) {
ret = -ERANGE;
goto out;
}
@@ -1155,7 +1155,7 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
ctx.family = (u16)user_ctx->family;
ctx.protocol = (u16)user_ctx->protocol;
ctx.dport = (u16)user_ctx->local_port;
- ctx.sport = (__force __be16)user_ctx->remote_port;
+ ctx.sport = user_ctx->remote_port;
switch (ctx.family) {
case AF_INET:
diff --git a/net/core/filter.c b/net/core/filter.c
index 99a05199a806..83f06d3b2c52 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10843,7 +10843,8 @@ static bool sk_lookup_is_valid_access(int off, int size,
case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
- case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
+ case offsetof(struct bpf_sk_lookup, remote_port) ...
+ offsetof(struct bpf_sk_lookup, local_ip4) - 1:
case bpf_ctx_range(struct bpf_sk_lookup, local_port):
case bpf_ctx_range(struct bpf_sk_lookup, ingress_ifindex):
bpf_ctx_record_field_size(info, sizeof(__u32));
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9a69e2b385f443f244a7e8b8bcafe5ccfb0866b4 Mon Sep 17 00:00:00 2001
From: Jakub Sitnicki <jakub(a)cloudflare.com>
Date: Wed, 9 Feb 2022 19:43:32 +0100
Subject: [PATCH] bpf: Make remote_port field in struct bpf_sk_lookup 16-bit
wide
remote_port is another case of a BPF context field documented as a 32-bit
value in network byte order for which the BPF context access converter
generates a load of a zero-padded 16-bit integer in network byte order.
First such case was dst_port in bpf_sock which got addressed in commit
4421a582718a ("bpf: Make dst_port field in struct bpf_sock 16-bit wide").
Loading 4-bytes from the remote_port offset and converting the value with
bpf_ntohl() leads to surprising results, as the expected value is shifted
by 16 bits.
Reduce the confusion by splitting the field in two - a 16-bit field holding
a big-endian integer, and a 16-bit zero-padding anonymous field that
follows it.
Suggested-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Jakub Sitnicki <jakub(a)cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Link: https://lore.kernel.org/bpf/20220209184333.654927-2-jakub@cloudflare.com
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a7f0ddedac1f..afe3d0d7f5f2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6453,7 +6453,8 @@ struct bpf_sk_lookup {
__u32 protocol; /* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
__u32 remote_ip4; /* Network byte order */
__u32 remote_ip6[4]; /* Network byte order */
- __u32 remote_port; /* Network byte order */
+ __be16 remote_port; /* Network byte order */
+ __u16 :16; /* Zero padding */
__u32 local_ip4; /* Network byte order */
__u32 local_ip6[4]; /* Network byte order */
__u32 local_port; /* Host byte order */
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index cb150f756f3d..f08034500813 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1147,7 +1147,7 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
if (!range_is_zero(user_ctx, offsetofend(typeof(*user_ctx), local_port), sizeof(*user_ctx)))
goto out;
- if (user_ctx->local_port > U16_MAX || user_ctx->remote_port > U16_MAX) {
+ if (user_ctx->local_port > U16_MAX) {
ret = -ERANGE;
goto out;
}
@@ -1155,7 +1155,7 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
ctx.family = (u16)user_ctx->family;
ctx.protocol = (u16)user_ctx->protocol;
ctx.dport = (u16)user_ctx->local_port;
- ctx.sport = (__force __be16)user_ctx->remote_port;
+ ctx.sport = user_ctx->remote_port;
switch (ctx.family) {
case AF_INET:
diff --git a/net/core/filter.c b/net/core/filter.c
index 99a05199a806..83f06d3b2c52 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10843,7 +10843,8 @@ static bool sk_lookup_is_valid_access(int off, int size,
case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
- case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
+ case offsetof(struct bpf_sk_lookup, remote_port) ...
+ offsetof(struct bpf_sk_lookup, local_ip4) - 1:
case bpf_ctx_range(struct bpf_sk_lookup, local_port):
case bpf_ctx_range(struct bpf_sk_lookup, ingress_ifindex):
bpf_ctx_record_field_size(info, sizeof(__u32));
This bug is marked as fixed by commit:
net: core: netlink: add helper refcount dec and lock function
net: sched: add helper function to take reference to Qdisc
net: sched: extend Qdisc with rcu
net: sched: rename qdisc_destroy() to qdisc_put()
net: sched: use Qdisc rcu API instead of relying on rtnl lock
But I can't find it in any tested tree for more than 90 days.
Is it a correct commit? Please update it by replying:
#syz fix: exact-commit-title
Until then the bug is still considered open and
new crashes with the same signature are ignored.
From: Johannes Berg <johannes.berg(a)intel.com>
We need this to be at least two bytes, so we can access
alpha2[0] and alpha2[1]. It may be three in case some
userspace used NUL-termination since it was NLA_STRING
(and we also push it out with NUL-termination).
Cc: stable(a)vger.kernel.org
Reported-by: Lee Jones <lee.jones(a)linaro.org>
Signed-off-by: Johannes Berg <johannes.berg(a)intel.com>
---
net/wireless/nl80211.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index ee1c2b6b6971..21e808fcb676 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -528,7 +528,8 @@ static const struct nla_policy nl80211_policy[NUM_NL80211_ATTR] = {
.len = IEEE80211_MAX_MESH_ID_LEN },
[NL80211_ATTR_MPATH_NEXT_HOP] = NLA_POLICY_ETH_ADDR_COMPAT,
- [NL80211_ATTR_REG_ALPHA2] = { .type = NLA_STRING, .len = 2 },
+ /* allow 3 for NUL-termination, we used to declare this NLA_STRING */
+ [NL80211_ATTR_REG_ALPHA2] = NLA_POLICY_RANGE(NLA_BINARY, 2, 3),
[NL80211_ATTR_REG_RULES] = { .type = NLA_NESTED },
[NL80211_ATTR_BSS_CTS_PROT] = { .type = NLA_U8 },
--
2.35.1
commit 4a204f7895878363ca8211f50ec610408c8c70aa upstream.
Expand KVM's mask for the AVIC host physical ID to the full 12 bits defined
by the architecture. The number of bits consumed by hardware is model
specific, e.g. early CPUs ignored bits 11:8, but there is no way for KVM
to enumerate the "true" size. So, KVM must allow using all bits, else it
risks rejecting completely legal x2APIC IDs on newer CPUs.
This means KVM relies on hardware to not assign x2APIC IDs that exceed the
"true" width of the field, but presumably hardware is smart enough to tie
the width to the max x2APIC ID. KVM also relies on hardware to support at
least 8 bits, as the legacy xAPIC ID is writable by software. But, those
assumptions are unavoidable due to the lack of any way to enumerate the
"true" width.
Note:
This patch has been modified from the upstream commit due to the conflict
caused by the commit 391503528257 ("KVM: x86: SVM: move avic definitions
from AMD's spec to svm.h")
Please apply this patch to the following kernel tree:
* 4.14-stable
* 4.9-stable
* 4.19-stable
* 5.4-stable
* 5.10-stable
* 5.15-stable
* 5.16-stable
Cc: stable(a)vger.kernel.org
Cc: Maxim Levitsky <mlevitsk(a)redhat.com>
Suggested-by: Sean Christopherson <seanjc(a)google.com>
Reviewed-by: Sean Christopherson <seanjc(a)google.com>
Fixes: 44a95dae1d22 ("KVM: x86: Detect and Initialize AVIC support")
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
Message-Id: <20220211000851.185799-1-suravee.suthikulpanit(a)amd.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
---
arch/x86/kvm/svm/avic.c | 7 +------
arch/x86/kvm/svm/svm.h | 2 +-
2 files changed, 2 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 212af871ca74..be0bc06797e6 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -943,15 +943,10 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
u64 entry;
- /* ID = 0xff (broadcast), ID > 0xff (reserved) */
int h_physical_id = kvm_cpu_get_apicid(cpu);
struct vcpu_svm *svm = to_svm(vcpu);
- /*
- * Since the host physical APIC id is 8 bits,
- * we can support host APIC ID upto 255.
- */
- if (WARN_ON(h_physical_id > AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
+ if (WARN_ON(h_physical_id & ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
return;
entry = READ_ONCE(*(svm->avic_physical_id_cache));
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 1b460e539926..47e4fd75caa9 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -508,7 +508,7 @@ extern struct kvm_x86_nested_ops svm_nested_ops;
#define AVIC_LOGICAL_ID_ENTRY_VALID_BIT 31
#define AVIC_LOGICAL_ID_ENTRY_VALID_MASK (1 << 31)
-#define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK (0xFFULL)
+#define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK GENMASK_ULL(11, 0)
#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK (0xFFFFFFFFFFULL << 12)
#define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK (1ULL << 62)
#define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK (1ULL << 63)
--
2.17.1
On Fri, Apr 08, 2022 at 07:53:42PM -0400, Joshua Freedman wrote:
> I got the first bisect done, but not sure what do bisect after that.....
What is the result, good or bad? What does 'git bisect log' show?
thanks,
greg k-h
Commit 87ebbb8c612b broke suspend-to-ram on kernel 5.17, trying to resume
with it results in an unusable system. I've tried reverting it from 5.17
and found that resuming works once again.
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3be232f11a3cc9b0ef0795e39fa11bdb8e422a06 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Tue, 26 Oct 2021 18:01:07 -0400
Subject: [PATCH] SUNRPC: Prevent immediate close+reconnect
If we have already set up the socket and are waiting for it to connect,
then don't immediately close and retry.
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 691fe5a682b6..a02de2bddb28 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -767,7 +767,8 @@ EXPORT_SYMBOL_GPL(xprt_disconnect_done);
*/
static void xprt_schedule_autoclose_locked(struct rpc_xprt *xprt)
{
- set_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ if (test_and_set_bit(XPRT_CLOSE_WAIT, &xprt->state))
+ return;
if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
queue_work(xprtiod_workqueue, &xprt->task_cleanup);
else if (xprt->snd_task && !test_bit(XPRT_SND_IS_COOKIE, &xprt->state))
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 7fb302e202bc..ae48c9c84ee1 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2314,7 +2314,7 @@ static void xs_connect(struct rpc_xprt *xprt, struct rpc_task *task)
WARN_ON_ONCE(!xprt_lock_connect(xprt, task, transport));
- if (transport->sock != NULL) {
+ if (transport->sock != NULL && !xprt_connecting(xprt)) {
dprintk("RPC: xs_connect delayed xprt %p for %lu "
"seconds\n",
xprt, xprt->reestablish_timeout / HZ);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 89f42494f92f448747bd8a7ab1ae8b5d5520577d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Wed, 16 Mar 2022 19:10:43 -0400
Subject: [PATCH] SUNRPC: Don't call connect() more than once on a TCP socket
Avoid socket state races due to repeated calls to ->connect() using the
same socket. If connect() returns 0 due to the connection having
completed, but we are in fact in a closing state, then we may leave the
XPRT_CONNECTING flag set on the transport.
Reported-by: Enrico Scholz <enrico.scholz(a)sigma-chemnitz.de>
Fixes: 3be232f11a3c ("SUNRPC: Prevent immediate close+reconnect")
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index 8c2a712cb242..689062afdd61 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -89,5 +89,6 @@ struct sock_xprt {
#define XPRT_SOCK_WAKE_WRITE (5)
#define XPRT_SOCK_WAKE_PENDING (6)
#define XPRT_SOCK_WAKE_DISCONNECT (7)
+#define XPRT_SOCK_CONNECT_SENT (8)
#endif /* _LINUX_SUNRPC_XPRTSOCK_H */
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 7e39f87cde2d..8f8a03c3315a 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2235,10 +2235,15 @@ static void xs_tcp_setup_socket(struct work_struct *work)
if (atomic_read(&xprt->swapper))
current->flags |= PF_MEMALLOC;
- if (!sock) {
- sock = xs_create_sock(xprt, transport,
- xs_addr(xprt)->sa_family, SOCK_STREAM,
- IPPROTO_TCP, true);
+
+ if (xprt_connected(xprt))
+ goto out;
+ if (test_and_clear_bit(XPRT_SOCK_CONNECT_SENT,
+ &transport->sock_state) ||
+ !sock) {
+ xs_reset_transport(transport);
+ sock = xs_create_sock(xprt, transport, xs_addr(xprt)->sa_family,
+ SOCK_STREAM, IPPROTO_TCP, true);
if (IS_ERR(sock)) {
xprt_wake_pending_tasks(xprt, PTR_ERR(sock));
goto out;
@@ -2262,6 +2267,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
fallthrough;
case -EINPROGRESS:
/* SYN_SENT! */
+ set_bit(XPRT_SOCK_CONNECT_SENT, &transport->sock_state);
if (xprt->reestablish_timeout < XS_TCP_INIT_REEST_TO)
xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
fallthrough;
@@ -2323,13 +2329,9 @@ static void xs_connect(struct rpc_xprt *xprt, struct rpc_task *task)
WARN_ON_ONCE(!xprt_lock_connect(xprt, task, transport));
- if (transport->sock != NULL && !xprt_connecting(xprt)) {
+ if (transport->sock != NULL) {
dprintk("RPC: xs_connect delayed xprt %p for %lu "
- "seconds\n",
- xprt, xprt->reestablish_timeout / HZ);
-
- /* Start by resetting any existing state */
- xs_reset_transport(transport);
+ "seconds\n", xprt, xprt->reestablish_timeout / HZ);
delay = xprt_reconnect_delay(xprt);
xprt_reconnect_backoff(xprt, XS_TCP_INIT_REEST_TO);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f00432063db1a0db484e85193eccc6845435b80e Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Sun, 3 Apr 2022 15:58:11 -0400
Subject: [PATCH] SUNRPC: Ensure we flush any closed sockets before
xs_xprt_free()
We must ensure that all sockets are closed before we call xprt_free()
and release the reference to the net namespace. The problem is that
calling fput() will defer closing the socket until delayed_fput() gets
called.
Let's fix the situation by allowing rpciod and the transport teardown
code (which runs on the system wq) to call __fput_sync(), and directly
close the socket.
Reported-by: Felix Fu <foyjog(a)gmail.com>
Acked-by: Al Viro <viro(a)zeniv.linux.org.uk>
Fixes: a73881c96d73 ("SUNRPC: Fix an Oops in udp_poll()")
Cc: stable(a)vger.kernel.org # 5.1.x: 3be232f11a3c: SUNRPC: Prevent immediate close+reconnect
Cc: stable(a)vger.kernel.org # 5.1.x: 89f42494f92f: SUNRPC: Don't call connect() more than once on a TCP socket
Cc: stable(a)vger.kernel.org # 5.1.x
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/fs/file_table.c b/fs/file_table.c
index 7d2e692b66a9..ada8fe814db9 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -412,6 +412,7 @@ void __fput_sync(struct file *file)
}
EXPORT_SYMBOL(fput);
+EXPORT_SYMBOL(__fput_sync);
void __init files_init(void)
{
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index ac33892da411..a4848c7bab80 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1004,7 +1004,6 @@ DEFINE_RPC_XPRT_LIFETIME_EVENT(connect);
DEFINE_RPC_XPRT_LIFETIME_EVENT(disconnect_auto);
DEFINE_RPC_XPRT_LIFETIME_EVENT(disconnect_done);
DEFINE_RPC_XPRT_LIFETIME_EVENT(disconnect_force);
-DEFINE_RPC_XPRT_LIFETIME_EVENT(disconnect_cleanup);
DEFINE_RPC_XPRT_LIFETIME_EVENT(destroy);
DECLARE_EVENT_CLASS(rpc_xprt_event,
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 73344ffb2692..ad62eba540a4 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -930,12 +930,7 @@ void xprt_connect(struct rpc_task *task)
if (!xprt_lock_write(xprt, task))
return;
- if (test_and_clear_bit(XPRT_CLOSE_WAIT, &xprt->state)) {
- trace_xprt_disconnect_cleanup(xprt);
- xprt->ops->close(xprt);
- }
-
- if (!xprt_connected(xprt)) {
+ if (!xprt_connected(xprt) && !test_bit(XPRT_CLOSE_WAIT, &xprt->state)) {
task->tk_rqstp->rq_connect_cookie = xprt->connect_cookie;
rpc_sleep_on_timeout(&xprt->pending, task, NULL,
xprt_request_timeout(task->tk_rqstp));
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 9b75891b3cc0..c6a13893e308 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -879,7 +879,7 @@ static int xs_local_send_request(struct rpc_rqst *req)
/* Close the stream if the previous transmission was incomplete */
if (xs_send_request_was_aborted(transport, req)) {
- xs_close(xprt);
+ xprt_force_disconnect(xprt);
return -ENOTCONN;
}
@@ -915,7 +915,7 @@ static int xs_local_send_request(struct rpc_rqst *req)
-status);
fallthrough;
case -EPIPE:
- xs_close(xprt);
+ xprt_force_disconnect(xprt);
status = -ENOTCONN;
}
@@ -1185,6 +1185,16 @@ static void xs_reset_transport(struct sock_xprt *transport)
if (sk == NULL)
return;
+ /*
+ * Make sure we're calling this in a context from which it is safe
+ * to call __fput_sync(). In practice that means rpciod and the
+ * system workqueue.
+ */
+ if (!(current->flags & PF_WQ_WORKER)) {
+ WARN_ON_ONCE(1);
+ set_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ return;
+ }
if (atomic_read(&transport->xprt.swapper))
sk_clear_memalloc(sk);
@@ -1208,7 +1218,7 @@ static void xs_reset_transport(struct sock_xprt *transport)
mutex_unlock(&transport->recv_mutex);
trace_rpc_socket_close(xprt, sock);
- fput(filp);
+ __fput_sync(filp);
xprt_disconnect_done(xprt);
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f00432063db1a0db484e85193eccc6845435b80e Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Sun, 3 Apr 2022 15:58:11 -0400
Subject: [PATCH] SUNRPC: Ensure we flush any closed sockets before
xs_xprt_free()
We must ensure that all sockets are closed before we call xprt_free()
and release the reference to the net namespace. The problem is that
calling fput() will defer closing the socket until delayed_fput() gets
called.
Let's fix the situation by allowing rpciod and the transport teardown
code (which runs on the system wq) to call __fput_sync(), and directly
close the socket.
Reported-by: Felix Fu <foyjog(a)gmail.com>
Acked-by: Al Viro <viro(a)zeniv.linux.org.uk>
Fixes: a73881c96d73 ("SUNRPC: Fix an Oops in udp_poll()")
Cc: stable(a)vger.kernel.org # 5.1.x: 3be232f11a3c: SUNRPC: Prevent immediate close+reconnect
Cc: stable(a)vger.kernel.org # 5.1.x: 89f42494f92f: SUNRPC: Don't call connect() more than once on a TCP socket
Cc: stable(a)vger.kernel.org # 5.1.x
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/fs/file_table.c b/fs/file_table.c
index 7d2e692b66a9..ada8fe814db9 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -412,6 +412,7 @@ void __fput_sync(struct file *file)
}
EXPORT_SYMBOL(fput);
+EXPORT_SYMBOL(__fput_sync);
void __init files_init(void)
{
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index ac33892da411..a4848c7bab80 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1004,7 +1004,6 @@ DEFINE_RPC_XPRT_LIFETIME_EVENT(connect);
DEFINE_RPC_XPRT_LIFETIME_EVENT(disconnect_auto);
DEFINE_RPC_XPRT_LIFETIME_EVENT(disconnect_done);
DEFINE_RPC_XPRT_LIFETIME_EVENT(disconnect_force);
-DEFINE_RPC_XPRT_LIFETIME_EVENT(disconnect_cleanup);
DEFINE_RPC_XPRT_LIFETIME_EVENT(destroy);
DECLARE_EVENT_CLASS(rpc_xprt_event,
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 73344ffb2692..ad62eba540a4 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -930,12 +930,7 @@ void xprt_connect(struct rpc_task *task)
if (!xprt_lock_write(xprt, task))
return;
- if (test_and_clear_bit(XPRT_CLOSE_WAIT, &xprt->state)) {
- trace_xprt_disconnect_cleanup(xprt);
- xprt->ops->close(xprt);
- }
-
- if (!xprt_connected(xprt)) {
+ if (!xprt_connected(xprt) && !test_bit(XPRT_CLOSE_WAIT, &xprt->state)) {
task->tk_rqstp->rq_connect_cookie = xprt->connect_cookie;
rpc_sleep_on_timeout(&xprt->pending, task, NULL,
xprt_request_timeout(task->tk_rqstp));
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 9b75891b3cc0..c6a13893e308 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -879,7 +879,7 @@ static int xs_local_send_request(struct rpc_rqst *req)
/* Close the stream if the previous transmission was incomplete */
if (xs_send_request_was_aborted(transport, req)) {
- xs_close(xprt);
+ xprt_force_disconnect(xprt);
return -ENOTCONN;
}
@@ -915,7 +915,7 @@ static int xs_local_send_request(struct rpc_rqst *req)
-status);
fallthrough;
case -EPIPE:
- xs_close(xprt);
+ xprt_force_disconnect(xprt);
status = -ENOTCONN;
}
@@ -1185,6 +1185,16 @@ static void xs_reset_transport(struct sock_xprt *transport)
if (sk == NULL)
return;
+ /*
+ * Make sure we're calling this in a context from which it is safe
+ * to call __fput_sync(). In practice that means rpciod and the
+ * system workqueue.
+ */
+ if (!(current->flags & PF_WQ_WORKER)) {
+ WARN_ON_ONCE(1);
+ set_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ return;
+ }
if (atomic_read(&transport->xprt.swapper))
sk_clear_memalloc(sk);
@@ -1208,7 +1218,7 @@ static void xs_reset_transport(struct sock_xprt *transport)
mutex_unlock(&transport->recv_mutex);
trace_rpc_socket_close(xprt, sock);
- fput(filp);
+ __fput_sync(filp);
xprt_disconnect_done(xprt);
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 89f42494f92f448747bd8a7ab1ae8b5d5520577d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Wed, 16 Mar 2022 19:10:43 -0400
Subject: [PATCH] SUNRPC: Don't call connect() more than once on a TCP socket
Avoid socket state races due to repeated calls to ->connect() using the
same socket. If connect() returns 0 due to the connection having
completed, but we are in fact in a closing state, then we may leave the
XPRT_CONNECTING flag set on the transport.
Reported-by: Enrico Scholz <enrico.scholz(a)sigma-chemnitz.de>
Fixes: 3be232f11a3c ("SUNRPC: Prevent immediate close+reconnect")
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index 8c2a712cb242..689062afdd61 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -89,5 +89,6 @@ struct sock_xprt {
#define XPRT_SOCK_WAKE_WRITE (5)
#define XPRT_SOCK_WAKE_PENDING (6)
#define XPRT_SOCK_WAKE_DISCONNECT (7)
+#define XPRT_SOCK_CONNECT_SENT (8)
#endif /* _LINUX_SUNRPC_XPRTSOCK_H */
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 7e39f87cde2d..8f8a03c3315a 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2235,10 +2235,15 @@ static void xs_tcp_setup_socket(struct work_struct *work)
if (atomic_read(&xprt->swapper))
current->flags |= PF_MEMALLOC;
- if (!sock) {
- sock = xs_create_sock(xprt, transport,
- xs_addr(xprt)->sa_family, SOCK_STREAM,
- IPPROTO_TCP, true);
+
+ if (xprt_connected(xprt))
+ goto out;
+ if (test_and_clear_bit(XPRT_SOCK_CONNECT_SENT,
+ &transport->sock_state) ||
+ !sock) {
+ xs_reset_transport(transport);
+ sock = xs_create_sock(xprt, transport, xs_addr(xprt)->sa_family,
+ SOCK_STREAM, IPPROTO_TCP, true);
if (IS_ERR(sock)) {
xprt_wake_pending_tasks(xprt, PTR_ERR(sock));
goto out;
@@ -2262,6 +2267,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
fallthrough;
case -EINPROGRESS:
/* SYN_SENT! */
+ set_bit(XPRT_SOCK_CONNECT_SENT, &transport->sock_state);
if (xprt->reestablish_timeout < XS_TCP_INIT_REEST_TO)
xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
fallthrough;
@@ -2323,13 +2329,9 @@ static void xs_connect(struct rpc_xprt *xprt, struct rpc_task *task)
WARN_ON_ONCE(!xprt_lock_connect(xprt, task, transport));
- if (transport->sock != NULL && !xprt_connecting(xprt)) {
+ if (transport->sock != NULL) {
dprintk("RPC: xs_connect delayed xprt %p for %lu "
- "seconds\n",
- xprt, xprt->reestablish_timeout / HZ);
-
- /* Start by resetting any existing state */
- xs_reset_transport(transport);
+ "seconds\n", xprt, xprt->reestablish_timeout / HZ);
delay = xprt_reconnect_delay(xprt);
xprt_reconnect_backoff(xprt, XS_TCP_INIT_REEST_TO);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 89f42494f92f448747bd8a7ab1ae8b5d5520577d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Wed, 16 Mar 2022 19:10:43 -0400
Subject: [PATCH] SUNRPC: Don't call connect() more than once on a TCP socket
Avoid socket state races due to repeated calls to ->connect() using the
same socket. If connect() returns 0 due to the connection having
completed, but we are in fact in a closing state, then we may leave the
XPRT_CONNECTING flag set on the transport.
Reported-by: Enrico Scholz <enrico.scholz(a)sigma-chemnitz.de>
Fixes: 3be232f11a3c ("SUNRPC: Prevent immediate close+reconnect")
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index 8c2a712cb242..689062afdd61 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -89,5 +89,6 @@ struct sock_xprt {
#define XPRT_SOCK_WAKE_WRITE (5)
#define XPRT_SOCK_WAKE_PENDING (6)
#define XPRT_SOCK_WAKE_DISCONNECT (7)
+#define XPRT_SOCK_CONNECT_SENT (8)
#endif /* _LINUX_SUNRPC_XPRTSOCK_H */
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 7e39f87cde2d..8f8a03c3315a 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2235,10 +2235,15 @@ static void xs_tcp_setup_socket(struct work_struct *work)
if (atomic_read(&xprt->swapper))
current->flags |= PF_MEMALLOC;
- if (!sock) {
- sock = xs_create_sock(xprt, transport,
- xs_addr(xprt)->sa_family, SOCK_STREAM,
- IPPROTO_TCP, true);
+
+ if (xprt_connected(xprt))
+ goto out;
+ if (test_and_clear_bit(XPRT_SOCK_CONNECT_SENT,
+ &transport->sock_state) ||
+ !sock) {
+ xs_reset_transport(transport);
+ sock = xs_create_sock(xprt, transport, xs_addr(xprt)->sa_family,
+ SOCK_STREAM, IPPROTO_TCP, true);
if (IS_ERR(sock)) {
xprt_wake_pending_tasks(xprt, PTR_ERR(sock));
goto out;
@@ -2262,6 +2267,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
fallthrough;
case -EINPROGRESS:
/* SYN_SENT! */
+ set_bit(XPRT_SOCK_CONNECT_SENT, &transport->sock_state);
if (xprt->reestablish_timeout < XS_TCP_INIT_REEST_TO)
xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
fallthrough;
@@ -2323,13 +2329,9 @@ static void xs_connect(struct rpc_xprt *xprt, struct rpc_task *task)
WARN_ON_ONCE(!xprt_lock_connect(xprt, task, transport));
- if (transport->sock != NULL && !xprt_connecting(xprt)) {
+ if (transport->sock != NULL) {
dprintk("RPC: xs_connect delayed xprt %p for %lu "
- "seconds\n",
- xprt, xprt->reestablish_timeout / HZ);
-
- /* Start by resetting any existing state */
- xs_reset_transport(transport);
+ "seconds\n", xprt, xprt->reestablish_timeout / HZ);
delay = xprt_reconnect_delay(xprt);
xprt_reconnect_backoff(xprt, XS_TCP_INIT_REEST_TO);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3be232f11a3cc9b0ef0795e39fa11bdb8e422a06 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Tue, 26 Oct 2021 18:01:07 -0400
Subject: [PATCH] SUNRPC: Prevent immediate close+reconnect
If we have already set up the socket and are waiting for it to connect,
then don't immediately close and retry.
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 691fe5a682b6..a02de2bddb28 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -767,7 +767,8 @@ EXPORT_SYMBOL_GPL(xprt_disconnect_done);
*/
static void xprt_schedule_autoclose_locked(struct rpc_xprt *xprt)
{
- set_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ if (test_and_set_bit(XPRT_CLOSE_WAIT, &xprt->state))
+ return;
if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
queue_work(xprtiod_workqueue, &xprt->task_cleanup);
else if (xprt->snd_task && !test_bit(XPRT_SND_IS_COOKIE, &xprt->state))
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 7fb302e202bc..ae48c9c84ee1 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2314,7 +2314,7 @@ static void xs_connect(struct rpc_xprt *xprt, struct rpc_task *task)
WARN_ON_ONCE(!xprt_lock_connect(xprt, task, transport));
- if (transport->sock != NULL) {
+ if (transport->sock != NULL && !xprt_connecting(xprt)) {
dprintk("RPC: xs_connect delayed xprt %p for %lu "
"seconds\n",
xprt, xprt->reestablish_timeout / HZ);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3be232f11a3cc9b0ef0795e39fa11bdb8e422a06 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Tue, 26 Oct 2021 18:01:07 -0400
Subject: [PATCH] SUNRPC: Prevent immediate close+reconnect
If we have already set up the socket and are waiting for it to connect,
then don't immediately close and retry.
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 691fe5a682b6..a02de2bddb28 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -767,7 +767,8 @@ EXPORT_SYMBOL_GPL(xprt_disconnect_done);
*/
static void xprt_schedule_autoclose_locked(struct rpc_xprt *xprt)
{
- set_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ if (test_and_set_bit(XPRT_CLOSE_WAIT, &xprt->state))
+ return;
if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
queue_work(xprtiod_workqueue, &xprt->task_cleanup);
else if (xprt->snd_task && !test_bit(XPRT_SND_IS_COOKIE, &xprt->state))
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 7fb302e202bc..ae48c9c84ee1 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2314,7 +2314,7 @@ static void xs_connect(struct rpc_xprt *xprt, struct rpc_task *task)
WARN_ON_ONCE(!xprt_lock_connect(xprt, task, transport));
- if (transport->sock != NULL) {
+ if (transport->sock != NULL && !xprt_connecting(xprt)) {
dprintk("RPC: xs_connect delayed xprt %p for %lu "
"seconds\n",
xprt, xprt->reestablish_timeout / HZ);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dda81d9761d07541c404dd5fa93e773a8eda5ddc Mon Sep 17 00:00:00 2001
From: tiancyin <tianci.yin(a)amd.com>
Date: Sun, 27 Mar 2022 19:07:13 +0800
Subject: [PATCH] drm/amd/vcn: fix an error msg on vcn 3.0
Some video card has more than one vcn instance, passing 0 to
vcn_v3_0_pause_dpg_mode is incorrect.
Error msg:
Register(1) [mmUVD_POWER_STATUS] failed to reach value
0x00000001 != 0x00000002
Reviewed-by: James Zhu <James.Zhu(a)amd.com>
Signed-off-by: tiancyin <tianci.yin(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index e1cca0a10653..cb5f0a12333f 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -1488,7 +1488,7 @@ static int vcn_v3_0_stop_dpg_mode(struct amdgpu_device *adev, int inst_idx)
struct dpg_pause_state state = {.fw_based = VCN_DPG_STATE__UNPAUSE};
uint32_t tmp;
- vcn_v3_0_pause_dpg_mode(adev, 0, &state);
+ vcn_v3_0_pause_dpg_mode(adev, inst_idx, &state);
/* Wait for power status to be 1 */
SOC15_WAIT_ON_RREG(VCN, inst_idx, mmUVD_POWER_STATUS, 1,
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dda81d9761d07541c404dd5fa93e773a8eda5ddc Mon Sep 17 00:00:00 2001
From: tiancyin <tianci.yin(a)amd.com>
Date: Sun, 27 Mar 2022 19:07:13 +0800
Subject: [PATCH] drm/amd/vcn: fix an error msg on vcn 3.0
Some video card has more than one vcn instance, passing 0 to
vcn_v3_0_pause_dpg_mode is incorrect.
Error msg:
Register(1) [mmUVD_POWER_STATUS] failed to reach value
0x00000001 != 0x00000002
Reviewed-by: James Zhu <James.Zhu(a)amd.com>
Signed-off-by: tiancyin <tianci.yin(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index e1cca0a10653..cb5f0a12333f 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -1488,7 +1488,7 @@ static int vcn_v3_0_stop_dpg_mode(struct amdgpu_device *adev, int inst_idx)
struct dpg_pause_state state = {.fw_based = VCN_DPG_STATE__UNPAUSE};
uint32_t tmp;
- vcn_v3_0_pause_dpg_mode(adev, 0, &state);
+ vcn_v3_0_pause_dpg_mode(adev, inst_idx, &state);
/* Wait for power status to be 1 */
SOC15_WAIT_ON_RREG(VCN, inst_idx, mmUVD_POWER_STATUS, 1,
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dda81d9761d07541c404dd5fa93e773a8eda5ddc Mon Sep 17 00:00:00 2001
From: tiancyin <tianci.yin(a)amd.com>
Date: Sun, 27 Mar 2022 19:07:13 +0800
Subject: [PATCH] drm/amd/vcn: fix an error msg on vcn 3.0
Some video card has more than one vcn instance, passing 0 to
vcn_v3_0_pause_dpg_mode is incorrect.
Error msg:
Register(1) [mmUVD_POWER_STATUS] failed to reach value
0x00000001 != 0x00000002
Reviewed-by: James Zhu <James.Zhu(a)amd.com>
Signed-off-by: tiancyin <tianci.yin(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index e1cca0a10653..cb5f0a12333f 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -1488,7 +1488,7 @@ static int vcn_v3_0_stop_dpg_mode(struct amdgpu_device *adev, int inst_idx)
struct dpg_pause_state state = {.fw_based = VCN_DPG_STATE__UNPAUSE};
uint32_t tmp;
- vcn_v3_0_pause_dpg_mode(adev, 0, &state);
+ vcn_v3_0_pause_dpg_mode(adev, inst_idx, &state);
/* Wait for power status to be 1 */
SOC15_WAIT_ON_RREG(VCN, inst_idx, mmUVD_POWER_STATUS, 1,
Commit 76821e222c18 ("serial: imx: ensure that RX irqs are off if RX is
off") accidentally enabled overrun interrupts unconditionally when
deferring DMA enable until after the receiver has been enabled during
startup.
Fix this by using the DMA-initialised instead of DMA-enabled flag to
determine whether overrun interrupts should be enabled.
Note that overrun interrupts are already accounted for in
imx_uart_clear_rx_errors() when using DMA since commit 41d98b5da92f
("serial: imx-serial - update RX error counters when DMA is used").
Fixes: 76821e222c18 ("serial: imx: ensure that RX irqs are off if RX is off")
Cc: stable(a)vger.kernel.org # 4.17
Cc: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/tty/serial/imx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index fd38e6ed4fda..a2100be8d554 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -1448,7 +1448,7 @@ static int imx_uart_startup(struct uart_port *port)
imx_uart_writel(sport, ucr1, UCR1);
ucr4 = imx_uart_readl(sport, UCR4) & ~(UCR4_OREN | UCR4_INVR);
- if (!sport->dma_is_enabled)
+ if (!dma_is_inited)
ucr4 |= UCR4_OREN;
if (sport->inverted_rx)
ucr4 |= UCR4_INVR;
--
2.35.1
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From af704c856e888fb044b058d731d61b46eeec499d Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Mon, 21 Mar 2022 18:48:05 -0600
Subject: [PATCH] random: skip fast_init if hwrng provides large chunk of
entropy
At boot time, EFI calls add_bootloader_randomness(), which in turn calls
add_hwgenerator_randomness(). Currently add_hwgenerator_randomness()
feeds the first 64 bytes of randomness to the "fast init"
non-crypto-grade phase. But if add_hwgenerator_randomness() gets called
with more than POOL_MIN_BITS of entropy, there's no point in passing it
off to the "fast init" stage, since that's enough entropy to bootstrap
the real RNG. The "fast init" stage is just there to provide _something_
in the case where we don't have enough entropy to properly bootstrap the
RNG. But if we do have enough entropy to bootstrap the RNG, the current
logic doesn't serve a purpose. So, in the case where we're passed
greater than or equal to POOL_MIN_BITS of entropy, this commit makes us
skip the "fast init" phase.
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 66ce7c03a142..74e0b069972e 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1128,7 +1128,7 @@ void rand_initialize_disk(struct gendisk *disk)
void add_hwgenerator_randomness(const void *buffer, size_t count,
size_t entropy)
{
- if (unlikely(crng_init == 0)) {
+ if (unlikely(crng_init == 0 && entropy < POOL_MIN_BITS)) {
size_t ret = crng_pre_init_inject(buffer, count, true);
mix_pool_bytes(buffer, ret);
count -= ret;
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 527a9867af29ff89f278d037db704e0ed50fb666 Mon Sep 17 00:00:00 2001
From: Jan Varho <jan.varho(a)gmail.com>
Date: Mon, 4 Apr 2022 19:42:30 +0300
Subject: [PATCH] random: do not split fast init input in
add_hwgenerator_randomness()
add_hwgenerator_randomness() tries to only use the required amount of input
for fast init, but credits all the entropy, rather than a fraction of
it. Since it's hard to determine how much entropy is left over out of a
non-unformly random sample, either give it all to fast init or credit
it, but don't attempt to do both. In the process, we can clean up the
injection code to no longer need to return a value.
Signed-off-by: Jan Varho <jan.varho(a)gmail.com>
[Jason: expanded commit message]
Fixes: 73c7733f122e ("random: do not throw away excess input to crng_fast_load")
Cc: stable(a)vger.kernel.org # 5.17+, requires af704c856e88
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 1d8242969751..ee3ad2ba0942 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -437,11 +437,8 @@ static void crng_make_state(u32 chacha_state[CHACHA_STATE_WORDS],
* This shouldn't be set by functions like add_device_randomness(),
* where we can't trust the buffer passed to it is guaranteed to be
* unpredictable (so it might not have any entropy at all).
- *
- * Returns the number of bytes processed from input, which is bounded
- * by CRNG_INIT_CNT_THRESH if account is true.
*/
-static size_t crng_pre_init_inject(const void *input, size_t len, bool account)
+static void crng_pre_init_inject(const void *input, size_t len, bool account)
{
static int crng_init_cnt = 0;
struct blake2s_state hash;
@@ -452,18 +449,15 @@ static size_t crng_pre_init_inject(const void *input, size_t len, bool account)
spin_lock_irqsave(&base_crng.lock, flags);
if (crng_init != 0) {
spin_unlock_irqrestore(&base_crng.lock, flags);
- return 0;
+ return;
}
- if (account)
- len = min_t(size_t, len, CRNG_INIT_CNT_THRESH - crng_init_cnt);
-
blake2s_update(&hash, base_crng.key, sizeof(base_crng.key));
blake2s_update(&hash, input, len);
blake2s_final(&hash, base_crng.key);
if (account) {
- crng_init_cnt += len;
+ crng_init_cnt += min_t(size_t, len, CRNG_INIT_CNT_THRESH - crng_init_cnt);
if (crng_init_cnt >= CRNG_INIT_CNT_THRESH) {
++base_crng.generation;
crng_init = 1;
@@ -474,8 +468,6 @@ static size_t crng_pre_init_inject(const void *input, size_t len, bool account)
if (crng_init == 1)
pr_notice("fast init done\n");
-
- return len;
}
static void _get_random_bytes(void *buf, size_t nbytes)
@@ -1141,12 +1133,9 @@ void add_hwgenerator_randomness(const void *buffer, size_t count,
size_t entropy)
{
if (unlikely(crng_init == 0 && entropy < POOL_MIN_BITS)) {
- size_t ret = crng_pre_init_inject(buffer, count, true);
- mix_pool_bytes(buffer, ret);
- count -= ret;
- buffer += ret;
- if (!count || crng_init == 0)
- return;
+ crng_pre_init_inject(buffer, count, true);
+ mix_pool_bytes(buffer, count);
+ return;
}
/*
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae4d37b5df749926891583d42a6801b5da11e3c1 Mon Sep 17 00:00:00 2001
From: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Date: Wed, 6 Apr 2022 21:04:44 +0200
Subject: [PATCH] drbd: fix an invalid memory access caused by incorrect use of
list iterator
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The bug is here:
idr_remove(&connection->peer_devices, vnr);
If the previous for_each_connection() don't exit early (no goto hit
inside the loop), the iterator 'connection' after the loop will be a
bogus pointer to an invalid structure object containing the HEAD
(&resource->connections). As a result, the use of 'connection' above
will lead to a invalid memory access (including a possible invalid free
as idr_remove could call free_layer).
The original intention should have been to remove all peer_devices,
but the following lines have already done the work. So just remove
this line and the unneeded label, to fix this bug.
Cc: stable(a)vger.kernel.org
Fixes: c06ece6ba6f1b ("drbd: Turn connection->volumes into connection->peer_devices")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder(a)linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg(a)linbit.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 9676a1d214bc..d6dfa286ddb3 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2773,12 +2773,12 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
if (init_submitter(device)) {
err = ERR_NOMEM;
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
}
err = add_disk(disk);
if (err)
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
/* inherit the connection state */
device->state.conn = first_connection(resource)->cstate;
@@ -2792,8 +2792,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
drbd_debugfs_device_add(device);
return NO_ERROR;
-out_idr_remove_vol:
- idr_remove(&connection->peer_devices, vnr);
out_idr_remove_from_resource:
for_each_connection(connection, resource) {
peer_device = idr_remove(&connection->peer_devices, vnr);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae4d37b5df749926891583d42a6801b5da11e3c1 Mon Sep 17 00:00:00 2001
From: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Date: Wed, 6 Apr 2022 21:04:44 +0200
Subject: [PATCH] drbd: fix an invalid memory access caused by incorrect use of
list iterator
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The bug is here:
idr_remove(&connection->peer_devices, vnr);
If the previous for_each_connection() don't exit early (no goto hit
inside the loop), the iterator 'connection' after the loop will be a
bogus pointer to an invalid structure object containing the HEAD
(&resource->connections). As a result, the use of 'connection' above
will lead to a invalid memory access (including a possible invalid free
as idr_remove could call free_layer).
The original intention should have been to remove all peer_devices,
but the following lines have already done the work. So just remove
this line and the unneeded label, to fix this bug.
Cc: stable(a)vger.kernel.org
Fixes: c06ece6ba6f1b ("drbd: Turn connection->volumes into connection->peer_devices")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder(a)linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg(a)linbit.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 9676a1d214bc..d6dfa286ddb3 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2773,12 +2773,12 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
if (init_submitter(device)) {
err = ERR_NOMEM;
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
}
err = add_disk(disk);
if (err)
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
/* inherit the connection state */
device->state.conn = first_connection(resource)->cstate;
@@ -2792,8 +2792,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
drbd_debugfs_device_add(device);
return NO_ERROR;
-out_idr_remove_vol:
- idr_remove(&connection->peer_devices, vnr);
out_idr_remove_from_resource:
for_each_connection(connection, resource) {
peer_device = idr_remove(&connection->peer_devices, vnr);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae4d37b5df749926891583d42a6801b5da11e3c1 Mon Sep 17 00:00:00 2001
From: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Date: Wed, 6 Apr 2022 21:04:44 +0200
Subject: [PATCH] drbd: fix an invalid memory access caused by incorrect use of
list iterator
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The bug is here:
idr_remove(&connection->peer_devices, vnr);
If the previous for_each_connection() don't exit early (no goto hit
inside the loop), the iterator 'connection' after the loop will be a
bogus pointer to an invalid structure object containing the HEAD
(&resource->connections). As a result, the use of 'connection' above
will lead to a invalid memory access (including a possible invalid free
as idr_remove could call free_layer).
The original intention should have been to remove all peer_devices,
but the following lines have already done the work. So just remove
this line and the unneeded label, to fix this bug.
Cc: stable(a)vger.kernel.org
Fixes: c06ece6ba6f1b ("drbd: Turn connection->volumes into connection->peer_devices")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder(a)linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg(a)linbit.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 9676a1d214bc..d6dfa286ddb3 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2773,12 +2773,12 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
if (init_submitter(device)) {
err = ERR_NOMEM;
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
}
err = add_disk(disk);
if (err)
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
/* inherit the connection state */
device->state.conn = first_connection(resource)->cstate;
@@ -2792,8 +2792,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
drbd_debugfs_device_add(device);
return NO_ERROR;
-out_idr_remove_vol:
- idr_remove(&connection->peer_devices, vnr);
out_idr_remove_from_resource:
for_each_connection(connection, resource) {
peer_device = idr_remove(&connection->peer_devices, vnr);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae4d37b5df749926891583d42a6801b5da11e3c1 Mon Sep 17 00:00:00 2001
From: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Date: Wed, 6 Apr 2022 21:04:44 +0200
Subject: [PATCH] drbd: fix an invalid memory access caused by incorrect use of
list iterator
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The bug is here:
idr_remove(&connection->peer_devices, vnr);
If the previous for_each_connection() don't exit early (no goto hit
inside the loop), the iterator 'connection' after the loop will be a
bogus pointer to an invalid structure object containing the HEAD
(&resource->connections). As a result, the use of 'connection' above
will lead to a invalid memory access (including a possible invalid free
as idr_remove could call free_layer).
The original intention should have been to remove all peer_devices,
but the following lines have already done the work. So just remove
this line and the unneeded label, to fix this bug.
Cc: stable(a)vger.kernel.org
Fixes: c06ece6ba6f1b ("drbd: Turn connection->volumes into connection->peer_devices")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder(a)linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg(a)linbit.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 9676a1d214bc..d6dfa286ddb3 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2773,12 +2773,12 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
if (init_submitter(device)) {
err = ERR_NOMEM;
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
}
err = add_disk(disk);
if (err)
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
/* inherit the connection state */
device->state.conn = first_connection(resource)->cstate;
@@ -2792,8 +2792,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
drbd_debugfs_device_add(device);
return NO_ERROR;
-out_idr_remove_vol:
- idr_remove(&connection->peer_devices, vnr);
out_idr_remove_from_resource:
for_each_connection(connection, resource) {
peer_device = idr_remove(&connection->peer_devices, vnr);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae4d37b5df749926891583d42a6801b5da11e3c1 Mon Sep 17 00:00:00 2001
From: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Date: Wed, 6 Apr 2022 21:04:44 +0200
Subject: [PATCH] drbd: fix an invalid memory access caused by incorrect use of
list iterator
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The bug is here:
idr_remove(&connection->peer_devices, vnr);
If the previous for_each_connection() don't exit early (no goto hit
inside the loop), the iterator 'connection' after the loop will be a
bogus pointer to an invalid structure object containing the HEAD
(&resource->connections). As a result, the use of 'connection' above
will lead to a invalid memory access (including a possible invalid free
as idr_remove could call free_layer).
The original intention should have been to remove all peer_devices,
but the following lines have already done the work. So just remove
this line and the unneeded label, to fix this bug.
Cc: stable(a)vger.kernel.org
Fixes: c06ece6ba6f1b ("drbd: Turn connection->volumes into connection->peer_devices")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder(a)linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg(a)linbit.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 9676a1d214bc..d6dfa286ddb3 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2773,12 +2773,12 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
if (init_submitter(device)) {
err = ERR_NOMEM;
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
}
err = add_disk(disk);
if (err)
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
/* inherit the connection state */
device->state.conn = first_connection(resource)->cstate;
@@ -2792,8 +2792,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
drbd_debugfs_device_add(device);
return NO_ERROR;
-out_idr_remove_vol:
- idr_remove(&connection->peer_devices, vnr);
out_idr_remove_from_resource:
for_each_connection(connection, resource) {
peer_device = idr_remove(&connection->peer_devices, vnr);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae4d37b5df749926891583d42a6801b5da11e3c1 Mon Sep 17 00:00:00 2001
From: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Date: Wed, 6 Apr 2022 21:04:44 +0200
Subject: [PATCH] drbd: fix an invalid memory access caused by incorrect use of
list iterator
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The bug is here:
idr_remove(&connection->peer_devices, vnr);
If the previous for_each_connection() don't exit early (no goto hit
inside the loop), the iterator 'connection' after the loop will be a
bogus pointer to an invalid structure object containing the HEAD
(&resource->connections). As a result, the use of 'connection' above
will lead to a invalid memory access (including a possible invalid free
as idr_remove could call free_layer).
The original intention should have been to remove all peer_devices,
but the following lines have already done the work. So just remove
this line and the unneeded label, to fix this bug.
Cc: stable(a)vger.kernel.org
Fixes: c06ece6ba6f1b ("drbd: Turn connection->volumes into connection->peer_devices")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder(a)linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg(a)linbit.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 9676a1d214bc..d6dfa286ddb3 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2773,12 +2773,12 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
if (init_submitter(device)) {
err = ERR_NOMEM;
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
}
err = add_disk(disk);
if (err)
- goto out_idr_remove_vol;
+ goto out_idr_remove_from_resource;
/* inherit the connection state */
device->state.conn = first_connection(resource)->cstate;
@@ -2792,8 +2792,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
drbd_debugfs_device_add(device);
return NO_ERROR;
-out_idr_remove_vol:
- idr_remove(&connection->peer_devices, vnr);
out_idr_remove_from_resource:
for_each_connection(connection, resource) {
peer_device = idr_remove(&connection->peer_devices, vnr);
The following commit has been merged into the irq/urgent branch of tip:
Commit-ID: 08d835dff916bfe8f45acc7b92c7af6c4081c8a7
Gitweb: https://git.kernel.org/tip/08d835dff916bfe8f45acc7b92c7af6c4081c8a7
Author: Rei Yamamoto <yamamoto.rei(a)jp.fujitsu.com>
AuthorDate: Thu, 31 Mar 2022 09:33:09 +09:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Mon, 11 Apr 2022 09:58:03 +02:00
genirq/affinity: Consider that CPUs on nodes can be unbalanced
If CPUs on a node are offline at boot time, the number of nodes is
different when building affinity masks for present cpus and when building
affinity masks for possible cpus. This causes the following problem:
In the case that the number of vectors is less than the number of nodes
there are cases where bits of masks for present cpus are overwritten when
building masks for possible cpus.
Fix this by excluding CPUs, which are not part of the current build mask
(present/possible).
[ tglx: Massaged changelog and added comment ]
Fixes: b82592199032 ("genirq/affinity: Spread IRQs to all available NUMA nodes")
Signed-off-by: Rei Yamamoto <yamamoto.rei(a)jp.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Ming Lei <ming.lei(a)redhat.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220331003309.10891-1-yamamoto.rei@jp.fujitsu.com
---
kernel/irq/affinity.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index f7ff891..fdf1704 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -269,8 +269,9 @@ static int __irq_build_affinity_masks(unsigned int startvec,
*/
if (numvecs <= nodes) {
for_each_node_mask(n, nodemsk) {
- cpumask_or(&masks[curvec].mask, &masks[curvec].mask,
- node_to_cpumask[n]);
+ /* Ensure that only CPUs which are in both masks are set */
+ cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
+ cpumask_or(&masks[curvec].mask, &masks[curvec].mask, nmsk);
if (++curvec == last_affv)
curvec = firstvec;
}
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 62ed0bf7315b524973bb5fb9174b60e353289835 Mon Sep 17 00:00:00 2001
From: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Date: Mon, 7 Mar 2022 02:47:18 -0800
Subject: [PATCH] btrfs: zoned: remove left over ASSERT checking for single
profile
With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block
groups") we started allowing DUP on metadata block groups, so the
ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are
no longer valid and in fact even harmful.
Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups")
CC: stable(a)vger.kernel.org # 5.17
Signed-off-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 61125aec8723..1b1b310c3c51 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1801,7 +1801,6 @@ struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
map = em->map_lookup;
/* We only support single profile for now */
- ASSERT(map->num_stripes == 1);
device = map->stripes[0].dev;
free_extent_map(em);
@@ -1983,9 +1982,6 @@ bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags)
if (!btrfs_is_zoned(fs_info))
return true;
- /* Non-single profiles are not supported yet */
- ASSERT((flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0);
-
/* Check if there is a device with active zones left */
mutex_lock(&fs_info->chunk_mutex);
list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 60021bd754c6ca0addc6817994f20290a321d8d6 Mon Sep 17 00:00:00 2001
From: Kaiwen Hu <kevinhu(a)synology.com>
Date: Wed, 23 Mar 2022 15:10:32 +0800
Subject: [PATCH] btrfs: prevent subvol with swapfile from being deleted
A subvolume with an active swapfile must not be deleted otherwise it
would not be possible to deactivate it.
After the subvolume is deleted, we cannot swapoff the swapfile in this
deleted subvolume because the path is unreachable. The swapfile is
still active and holding references, the filesystem cannot be unmounted.
The test looks like this:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
btrfs sub create $mnt/subvol
touch $mnt/subvol/swapfile
chmod 600 $mnt/subvol/swapfile
chattr +C $mnt/subvol/swapfile
dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096
mkswap $mnt/subvol/swapfile
swapon $mnt/subvol/swapfile
btrfs sub delete $mnt/subvol
swapoff $mnt/subvol/swapfile # failed: No such file or directory
swapoff --all
unmount $mnt # target is busy.
To prevent above issue, we simply check that whether the subvolume
contains any active swapfile, and stop the deleting process. This
behavior is like snapshot ioctl dealing with a swapfile.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Robbie Ko <robbieko(a)synology.com>
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Kaiwen Hu <kevinhu(a)synology.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b976f757571f..5aab6af88349 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4487,6 +4487,13 @@ int btrfs_delete_subvolume(struct inode *dir, struct dentry *dentry)
dest->root_key.objectid);
return -EPERM;
}
+ if (atomic_read(&dest->nr_swapfiles)) {
+ spin_unlock(&dest->root_item_lock);
+ btrfs_warn(fs_info,
+ "attempt to delete subvolume %llu with active swapfile",
+ root->root_key.objectid);
+ return -EPERM;
+ }
root_flags = btrfs_root_flags(&dest->root_item);
btrfs_set_root_flags(&dest->root_item,
root_flags | BTRFS_ROOT_SUBVOL_DEAD);
@@ -11110,8 +11117,23 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
* set. We use this counter to prevent snapshots. We must increment it
* before walking the extents because we don't want a concurrent
* snapshot to run after we've already checked the extents.
+ *
+ * It is possible that subvolume is marked for deletion but still not
+ * removed yet. To prevent this race, we check the root status before
+ * activating the swapfile.
*/
+ spin_lock(&root->root_item_lock);
+ if (btrfs_root_dead(root)) {
+ spin_unlock(&root->root_item_lock);
+
+ btrfs_exclop_finish(fs_info);
+ btrfs_warn(fs_info,
+ "cannot activate swapfile because subvolume %llu is being deleted",
+ root->root_key.objectid);
+ return -EPERM;
+ }
atomic_inc(&root->nr_swapfiles);
+ spin_unlock(&root->root_item_lock);
isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bbac58698a55cc0a6f0c0d69a6dcd3f9f3134c11 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Tue, 8 Mar 2022 13:36:38 +0800
Subject: [PATCH] btrfs: remove device item and update super block in the same
transaction
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
[BUG]
There is a report that a btrfs has a bad super block num devices.
This makes btrfs to reject the fs completely.
BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
BTRFS error (device sdd3): failed to read chunk tree: -22
BTRFS error (device sdd3): open_ctree failed
[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:
btrfs_rm_device()
|- btrfs_rm_dev_item(device)
| |- trans = btrfs_start_transaction()
| | Now we got transaction X
| |
| |- btrfs_del_item()
| | Now device item is removed from chunk tree
| |
| |- btrfs_commit_transaction()
| Transaction X got committed, super num devs untouched,
| but device item removed from chunk tree.
| (AKA, super num devs is already incorrect)
|
|- cur_devices->num_devices--;
|- cur_devices->total_devices--;
|- btrfs_set_super_num_devices()
All those operations are not in transaction X, thus it will
only be written back to disk in next transaction.
So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.
[FIX]
Instead of starting and committing a transaction inside
btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
pass it to btrfs_rm_dev_item().
And only commit the transaction after everything is done.
Reported-by: Luca Béla Palkovics <luca.bela.palkovics(a)gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46C…
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1be7cb2f955f..2cfbc74a3b4e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1896,23 +1896,18 @@ static void update_dev_time(const char *device_path)
path_put(&path);
}
-static int btrfs_rm_dev_item(struct btrfs_device *device)
+static int btrfs_rm_dev_item(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device)
{
struct btrfs_root *root = device->fs_info->chunk_root;
int ret;
struct btrfs_path *path;
struct btrfs_key key;
- struct btrfs_trans_handle *trans;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
- trans = btrfs_start_transaction(root, 0);
- if (IS_ERR(trans)) {
- btrfs_free_path(path);
- return PTR_ERR(trans);
- }
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
key.type = BTRFS_DEV_ITEM_KEY;
key.offset = device->devid;
@@ -1923,21 +1918,12 @@ static int btrfs_rm_dev_item(struct btrfs_device *device)
if (ret) {
if (ret > 0)
ret = -ENOENT;
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
goto out;
}
ret = btrfs_del_item(trans, root, path);
- if (ret) {
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
- }
-
out:
btrfs_free_path(path);
- if (!ret)
- ret = btrfs_commit_transaction(trans);
return ret;
}
@@ -2078,6 +2064,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
struct btrfs_dev_lookup_args *args,
struct block_device **bdev, fmode_t *mode)
{
+ struct btrfs_trans_handle *trans;
struct btrfs_device *device;
struct btrfs_fs_devices *cur_devices;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
@@ -2098,7 +2085,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
if (ret)
- goto out;
+ return ret;
device = btrfs_find_device(fs_info->fs_devices, args);
if (!device) {
@@ -2106,27 +2093,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = BTRFS_ERROR_DEV_MISSING_NOT_FOUND;
else
ret = -ENOENT;
- goto out;
+ return ret;
}
if (btrfs_pinned_by_swapfile(fs_info, device)) {
btrfs_warn_in_rcu(fs_info,
"cannot remove device %s (devid %llu) due to active swapfile",
rcu_str_deref(device->name), device->devid);
- ret = -ETXTBSY;
- goto out;
+ return -ETXTBSY;
}
- if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
- ret = BTRFS_ERROR_DEV_TGT_REPLACE;
- goto out;
- }
+ if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+ return BTRFS_ERROR_DEV_TGT_REPLACE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
- fs_info->fs_devices->rw_devices == 1) {
- ret = BTRFS_ERROR_DEV_ONLY_WRITABLE;
- goto out;
- }
+ fs_info->fs_devices->rw_devices == 1)
+ return BTRFS_ERROR_DEV_ONLY_WRITABLE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
mutex_lock(&fs_info->chunk_mutex);
@@ -2139,14 +2121,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
if (ret)
goto error_undo;
- /*
- * TODO: the superblock still includes this device in its num_devices
- * counter although write_all_supers() is not locked out. This
- * could give a filesystem state which requires a degraded mount.
- */
- ret = btrfs_rm_dev_item(device);
- if (ret)
+ trans = btrfs_start_transaction(fs_info->chunk_root, 0);
+ if (IS_ERR(trans)) {
+ ret = PTR_ERR(trans);
goto error_undo;
+ }
+
+ ret = btrfs_rm_dev_item(trans, device);
+ if (ret) {
+ /* Any error in dev item removal is critical */
+ btrfs_crit(fs_info,
+ "failed to remove device item for devid %llu: %d",
+ device->devid, ret);
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ return ret;
+ }
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
btrfs_scrub_cancel_dev(device);
@@ -2229,7 +2219,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
free_fs_devices(cur_devices);
}
-out:
+ ret = btrfs_commit_transaction(trans);
+
return ret;
error_undo:
@@ -2240,7 +2231,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
device->fs_devices->rw_devices++;
mutex_unlock(&fs_info->chunk_mutex);
}
- goto out;
+ return ret;
}
void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bbac58698a55cc0a6f0c0d69a6dcd3f9f3134c11 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Tue, 8 Mar 2022 13:36:38 +0800
Subject: [PATCH] btrfs: remove device item and update super block in the same
transaction
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
[BUG]
There is a report that a btrfs has a bad super block num devices.
This makes btrfs to reject the fs completely.
BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
BTRFS error (device sdd3): failed to read chunk tree: -22
BTRFS error (device sdd3): open_ctree failed
[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:
btrfs_rm_device()
|- btrfs_rm_dev_item(device)
| |- trans = btrfs_start_transaction()
| | Now we got transaction X
| |
| |- btrfs_del_item()
| | Now device item is removed from chunk tree
| |
| |- btrfs_commit_transaction()
| Transaction X got committed, super num devs untouched,
| but device item removed from chunk tree.
| (AKA, super num devs is already incorrect)
|
|- cur_devices->num_devices--;
|- cur_devices->total_devices--;
|- btrfs_set_super_num_devices()
All those operations are not in transaction X, thus it will
only be written back to disk in next transaction.
So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.
[FIX]
Instead of starting and committing a transaction inside
btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
pass it to btrfs_rm_dev_item().
And only commit the transaction after everything is done.
Reported-by: Luca Béla Palkovics <luca.bela.palkovics(a)gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46C…
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1be7cb2f955f..2cfbc74a3b4e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1896,23 +1896,18 @@ static void update_dev_time(const char *device_path)
path_put(&path);
}
-static int btrfs_rm_dev_item(struct btrfs_device *device)
+static int btrfs_rm_dev_item(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device)
{
struct btrfs_root *root = device->fs_info->chunk_root;
int ret;
struct btrfs_path *path;
struct btrfs_key key;
- struct btrfs_trans_handle *trans;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
- trans = btrfs_start_transaction(root, 0);
- if (IS_ERR(trans)) {
- btrfs_free_path(path);
- return PTR_ERR(trans);
- }
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
key.type = BTRFS_DEV_ITEM_KEY;
key.offset = device->devid;
@@ -1923,21 +1918,12 @@ static int btrfs_rm_dev_item(struct btrfs_device *device)
if (ret) {
if (ret > 0)
ret = -ENOENT;
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
goto out;
}
ret = btrfs_del_item(trans, root, path);
- if (ret) {
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
- }
-
out:
btrfs_free_path(path);
- if (!ret)
- ret = btrfs_commit_transaction(trans);
return ret;
}
@@ -2078,6 +2064,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
struct btrfs_dev_lookup_args *args,
struct block_device **bdev, fmode_t *mode)
{
+ struct btrfs_trans_handle *trans;
struct btrfs_device *device;
struct btrfs_fs_devices *cur_devices;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
@@ -2098,7 +2085,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
if (ret)
- goto out;
+ return ret;
device = btrfs_find_device(fs_info->fs_devices, args);
if (!device) {
@@ -2106,27 +2093,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = BTRFS_ERROR_DEV_MISSING_NOT_FOUND;
else
ret = -ENOENT;
- goto out;
+ return ret;
}
if (btrfs_pinned_by_swapfile(fs_info, device)) {
btrfs_warn_in_rcu(fs_info,
"cannot remove device %s (devid %llu) due to active swapfile",
rcu_str_deref(device->name), device->devid);
- ret = -ETXTBSY;
- goto out;
+ return -ETXTBSY;
}
- if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
- ret = BTRFS_ERROR_DEV_TGT_REPLACE;
- goto out;
- }
+ if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+ return BTRFS_ERROR_DEV_TGT_REPLACE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
- fs_info->fs_devices->rw_devices == 1) {
- ret = BTRFS_ERROR_DEV_ONLY_WRITABLE;
- goto out;
- }
+ fs_info->fs_devices->rw_devices == 1)
+ return BTRFS_ERROR_DEV_ONLY_WRITABLE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
mutex_lock(&fs_info->chunk_mutex);
@@ -2139,14 +2121,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
if (ret)
goto error_undo;
- /*
- * TODO: the superblock still includes this device in its num_devices
- * counter although write_all_supers() is not locked out. This
- * could give a filesystem state which requires a degraded mount.
- */
- ret = btrfs_rm_dev_item(device);
- if (ret)
+ trans = btrfs_start_transaction(fs_info->chunk_root, 0);
+ if (IS_ERR(trans)) {
+ ret = PTR_ERR(trans);
goto error_undo;
+ }
+
+ ret = btrfs_rm_dev_item(trans, device);
+ if (ret) {
+ /* Any error in dev item removal is critical */
+ btrfs_crit(fs_info,
+ "failed to remove device item for devid %llu: %d",
+ device->devid, ret);
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ return ret;
+ }
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
btrfs_scrub_cancel_dev(device);
@@ -2229,7 +2219,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
free_fs_devices(cur_devices);
}
-out:
+ ret = btrfs_commit_transaction(trans);
+
return ret;
error_undo:
@@ -2240,7 +2231,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
device->fs_devices->rw_devices++;
mutex_unlock(&fs_info->chunk_mutex);
}
- goto out;
+ return ret;
}
void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bbac58698a55cc0a6f0c0d69a6dcd3f9f3134c11 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Tue, 8 Mar 2022 13:36:38 +0800
Subject: [PATCH] btrfs: remove device item and update super block in the same
transaction
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
[BUG]
There is a report that a btrfs has a bad super block num devices.
This makes btrfs to reject the fs completely.
BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
BTRFS error (device sdd3): failed to read chunk tree: -22
BTRFS error (device sdd3): open_ctree failed
[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:
btrfs_rm_device()
|- btrfs_rm_dev_item(device)
| |- trans = btrfs_start_transaction()
| | Now we got transaction X
| |
| |- btrfs_del_item()
| | Now device item is removed from chunk tree
| |
| |- btrfs_commit_transaction()
| Transaction X got committed, super num devs untouched,
| but device item removed from chunk tree.
| (AKA, super num devs is already incorrect)
|
|- cur_devices->num_devices--;
|- cur_devices->total_devices--;
|- btrfs_set_super_num_devices()
All those operations are not in transaction X, thus it will
only be written back to disk in next transaction.
So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.
[FIX]
Instead of starting and committing a transaction inside
btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
pass it to btrfs_rm_dev_item().
And only commit the transaction after everything is done.
Reported-by: Luca Béla Palkovics <luca.bela.palkovics(a)gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46C…
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1be7cb2f955f..2cfbc74a3b4e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1896,23 +1896,18 @@ static void update_dev_time(const char *device_path)
path_put(&path);
}
-static int btrfs_rm_dev_item(struct btrfs_device *device)
+static int btrfs_rm_dev_item(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device)
{
struct btrfs_root *root = device->fs_info->chunk_root;
int ret;
struct btrfs_path *path;
struct btrfs_key key;
- struct btrfs_trans_handle *trans;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
- trans = btrfs_start_transaction(root, 0);
- if (IS_ERR(trans)) {
- btrfs_free_path(path);
- return PTR_ERR(trans);
- }
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
key.type = BTRFS_DEV_ITEM_KEY;
key.offset = device->devid;
@@ -1923,21 +1918,12 @@ static int btrfs_rm_dev_item(struct btrfs_device *device)
if (ret) {
if (ret > 0)
ret = -ENOENT;
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
goto out;
}
ret = btrfs_del_item(trans, root, path);
- if (ret) {
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
- }
-
out:
btrfs_free_path(path);
- if (!ret)
- ret = btrfs_commit_transaction(trans);
return ret;
}
@@ -2078,6 +2064,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
struct btrfs_dev_lookup_args *args,
struct block_device **bdev, fmode_t *mode)
{
+ struct btrfs_trans_handle *trans;
struct btrfs_device *device;
struct btrfs_fs_devices *cur_devices;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
@@ -2098,7 +2085,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
if (ret)
- goto out;
+ return ret;
device = btrfs_find_device(fs_info->fs_devices, args);
if (!device) {
@@ -2106,27 +2093,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = BTRFS_ERROR_DEV_MISSING_NOT_FOUND;
else
ret = -ENOENT;
- goto out;
+ return ret;
}
if (btrfs_pinned_by_swapfile(fs_info, device)) {
btrfs_warn_in_rcu(fs_info,
"cannot remove device %s (devid %llu) due to active swapfile",
rcu_str_deref(device->name), device->devid);
- ret = -ETXTBSY;
- goto out;
+ return -ETXTBSY;
}
- if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
- ret = BTRFS_ERROR_DEV_TGT_REPLACE;
- goto out;
- }
+ if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+ return BTRFS_ERROR_DEV_TGT_REPLACE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
- fs_info->fs_devices->rw_devices == 1) {
- ret = BTRFS_ERROR_DEV_ONLY_WRITABLE;
- goto out;
- }
+ fs_info->fs_devices->rw_devices == 1)
+ return BTRFS_ERROR_DEV_ONLY_WRITABLE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
mutex_lock(&fs_info->chunk_mutex);
@@ -2139,14 +2121,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
if (ret)
goto error_undo;
- /*
- * TODO: the superblock still includes this device in its num_devices
- * counter although write_all_supers() is not locked out. This
- * could give a filesystem state which requires a degraded mount.
- */
- ret = btrfs_rm_dev_item(device);
- if (ret)
+ trans = btrfs_start_transaction(fs_info->chunk_root, 0);
+ if (IS_ERR(trans)) {
+ ret = PTR_ERR(trans);
goto error_undo;
+ }
+
+ ret = btrfs_rm_dev_item(trans, device);
+ if (ret) {
+ /* Any error in dev item removal is critical */
+ btrfs_crit(fs_info,
+ "failed to remove device item for devid %llu: %d",
+ device->devid, ret);
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ return ret;
+ }
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
btrfs_scrub_cancel_dev(device);
@@ -2229,7 +2219,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
free_fs_devices(cur_devices);
}
-out:
+ ret = btrfs_commit_transaction(trans);
+
return ret;
error_undo:
@@ -2240,7 +2231,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
device->fs_devices->rw_devices++;
mutex_unlock(&fs_info->chunk_mutex);
}
- goto out;
+ return ret;
}
void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bbac58698a55cc0a6f0c0d69a6dcd3f9f3134c11 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Tue, 8 Mar 2022 13:36:38 +0800
Subject: [PATCH] btrfs: remove device item and update super block in the same
transaction
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
[BUG]
There is a report that a btrfs has a bad super block num devices.
This makes btrfs to reject the fs completely.
BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
BTRFS error (device sdd3): failed to read chunk tree: -22
BTRFS error (device sdd3): open_ctree failed
[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:
btrfs_rm_device()
|- btrfs_rm_dev_item(device)
| |- trans = btrfs_start_transaction()
| | Now we got transaction X
| |
| |- btrfs_del_item()
| | Now device item is removed from chunk tree
| |
| |- btrfs_commit_transaction()
| Transaction X got committed, super num devs untouched,
| but device item removed from chunk tree.
| (AKA, super num devs is already incorrect)
|
|- cur_devices->num_devices--;
|- cur_devices->total_devices--;
|- btrfs_set_super_num_devices()
All those operations are not in transaction X, thus it will
only be written back to disk in next transaction.
So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.
[FIX]
Instead of starting and committing a transaction inside
btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
pass it to btrfs_rm_dev_item().
And only commit the transaction after everything is done.
Reported-by: Luca Béla Palkovics <luca.bela.palkovics(a)gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46C…
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1be7cb2f955f..2cfbc74a3b4e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1896,23 +1896,18 @@ static void update_dev_time(const char *device_path)
path_put(&path);
}
-static int btrfs_rm_dev_item(struct btrfs_device *device)
+static int btrfs_rm_dev_item(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device)
{
struct btrfs_root *root = device->fs_info->chunk_root;
int ret;
struct btrfs_path *path;
struct btrfs_key key;
- struct btrfs_trans_handle *trans;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
- trans = btrfs_start_transaction(root, 0);
- if (IS_ERR(trans)) {
- btrfs_free_path(path);
- return PTR_ERR(trans);
- }
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
key.type = BTRFS_DEV_ITEM_KEY;
key.offset = device->devid;
@@ -1923,21 +1918,12 @@ static int btrfs_rm_dev_item(struct btrfs_device *device)
if (ret) {
if (ret > 0)
ret = -ENOENT;
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
goto out;
}
ret = btrfs_del_item(trans, root, path);
- if (ret) {
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
- }
-
out:
btrfs_free_path(path);
- if (!ret)
- ret = btrfs_commit_transaction(trans);
return ret;
}
@@ -2078,6 +2064,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
struct btrfs_dev_lookup_args *args,
struct block_device **bdev, fmode_t *mode)
{
+ struct btrfs_trans_handle *trans;
struct btrfs_device *device;
struct btrfs_fs_devices *cur_devices;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
@@ -2098,7 +2085,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
if (ret)
- goto out;
+ return ret;
device = btrfs_find_device(fs_info->fs_devices, args);
if (!device) {
@@ -2106,27 +2093,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = BTRFS_ERROR_DEV_MISSING_NOT_FOUND;
else
ret = -ENOENT;
- goto out;
+ return ret;
}
if (btrfs_pinned_by_swapfile(fs_info, device)) {
btrfs_warn_in_rcu(fs_info,
"cannot remove device %s (devid %llu) due to active swapfile",
rcu_str_deref(device->name), device->devid);
- ret = -ETXTBSY;
- goto out;
+ return -ETXTBSY;
}
- if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
- ret = BTRFS_ERROR_DEV_TGT_REPLACE;
- goto out;
- }
+ if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+ return BTRFS_ERROR_DEV_TGT_REPLACE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
- fs_info->fs_devices->rw_devices == 1) {
- ret = BTRFS_ERROR_DEV_ONLY_WRITABLE;
- goto out;
- }
+ fs_info->fs_devices->rw_devices == 1)
+ return BTRFS_ERROR_DEV_ONLY_WRITABLE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
mutex_lock(&fs_info->chunk_mutex);
@@ -2139,14 +2121,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
if (ret)
goto error_undo;
- /*
- * TODO: the superblock still includes this device in its num_devices
- * counter although write_all_supers() is not locked out. This
- * could give a filesystem state which requires a degraded mount.
- */
- ret = btrfs_rm_dev_item(device);
- if (ret)
+ trans = btrfs_start_transaction(fs_info->chunk_root, 0);
+ if (IS_ERR(trans)) {
+ ret = PTR_ERR(trans);
goto error_undo;
+ }
+
+ ret = btrfs_rm_dev_item(trans, device);
+ if (ret) {
+ /* Any error in dev item removal is critical */
+ btrfs_crit(fs_info,
+ "failed to remove device item for devid %llu: %d",
+ device->devid, ret);
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ return ret;
+ }
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
btrfs_scrub_cancel_dev(device);
@@ -2229,7 +2219,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
free_fs_devices(cur_devices);
}
-out:
+ ret = btrfs_commit_transaction(trans);
+
return ret;
error_undo:
@@ -2240,7 +2231,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
device->fs_devices->rw_devices++;
mutex_unlock(&fs_info->chunk_mutex);
}
- goto out;
+ return ret;
}
void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev)
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bbac58698a55cc0a6f0c0d69a6dcd3f9f3134c11 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Tue, 8 Mar 2022 13:36:38 +0800
Subject: [PATCH] btrfs: remove device item and update super block in the same
transaction
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
[BUG]
There is a report that a btrfs has a bad super block num devices.
This makes btrfs to reject the fs completely.
BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
BTRFS error (device sdd3): failed to read chunk tree: -22
BTRFS error (device sdd3): open_ctree failed
[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:
btrfs_rm_device()
|- btrfs_rm_dev_item(device)
| |- trans = btrfs_start_transaction()
| | Now we got transaction X
| |
| |- btrfs_del_item()
| | Now device item is removed from chunk tree
| |
| |- btrfs_commit_transaction()
| Transaction X got committed, super num devs untouched,
| but device item removed from chunk tree.
| (AKA, super num devs is already incorrect)
|
|- cur_devices->num_devices--;
|- cur_devices->total_devices--;
|- btrfs_set_super_num_devices()
All those operations are not in transaction X, thus it will
only be written back to disk in next transaction.
So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.
[FIX]
Instead of starting and committing a transaction inside
btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
pass it to btrfs_rm_dev_item().
And only commit the transaction after everything is done.
Reported-by: Luca Béla Palkovics <luca.bela.palkovics(a)gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46C…
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1be7cb2f955f..2cfbc74a3b4e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1896,23 +1896,18 @@ static void update_dev_time(const char *device_path)
path_put(&path);
}
-static int btrfs_rm_dev_item(struct btrfs_device *device)
+static int btrfs_rm_dev_item(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device)
{
struct btrfs_root *root = device->fs_info->chunk_root;
int ret;
struct btrfs_path *path;
struct btrfs_key key;
- struct btrfs_trans_handle *trans;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
- trans = btrfs_start_transaction(root, 0);
- if (IS_ERR(trans)) {
- btrfs_free_path(path);
- return PTR_ERR(trans);
- }
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
key.type = BTRFS_DEV_ITEM_KEY;
key.offset = device->devid;
@@ -1923,21 +1918,12 @@ static int btrfs_rm_dev_item(struct btrfs_device *device)
if (ret) {
if (ret > 0)
ret = -ENOENT;
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
goto out;
}
ret = btrfs_del_item(trans, root, path);
- if (ret) {
- btrfs_abort_transaction(trans, ret);
- btrfs_end_transaction(trans);
- }
-
out:
btrfs_free_path(path);
- if (!ret)
- ret = btrfs_commit_transaction(trans);
return ret;
}
@@ -2078,6 +2064,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
struct btrfs_dev_lookup_args *args,
struct block_device **bdev, fmode_t *mode)
{
+ struct btrfs_trans_handle *trans;
struct btrfs_device *device;
struct btrfs_fs_devices *cur_devices;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
@@ -2098,7 +2085,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
if (ret)
- goto out;
+ return ret;
device = btrfs_find_device(fs_info->fs_devices, args);
if (!device) {
@@ -2106,27 +2093,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
ret = BTRFS_ERROR_DEV_MISSING_NOT_FOUND;
else
ret = -ENOENT;
- goto out;
+ return ret;
}
if (btrfs_pinned_by_swapfile(fs_info, device)) {
btrfs_warn_in_rcu(fs_info,
"cannot remove device %s (devid %llu) due to active swapfile",
rcu_str_deref(device->name), device->devid);
- ret = -ETXTBSY;
- goto out;
+ return -ETXTBSY;
}
- if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
- ret = BTRFS_ERROR_DEV_TGT_REPLACE;
- goto out;
- }
+ if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+ return BTRFS_ERROR_DEV_TGT_REPLACE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
- fs_info->fs_devices->rw_devices == 1) {
- ret = BTRFS_ERROR_DEV_ONLY_WRITABLE;
- goto out;
- }
+ fs_info->fs_devices->rw_devices == 1)
+ return BTRFS_ERROR_DEV_ONLY_WRITABLE;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
mutex_lock(&fs_info->chunk_mutex);
@@ -2139,14 +2121,22 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
if (ret)
goto error_undo;
- /*
- * TODO: the superblock still includes this device in its num_devices
- * counter although write_all_supers() is not locked out. This
- * could give a filesystem state which requires a degraded mount.
- */
- ret = btrfs_rm_dev_item(device);
- if (ret)
+ trans = btrfs_start_transaction(fs_info->chunk_root, 0);
+ if (IS_ERR(trans)) {
+ ret = PTR_ERR(trans);
goto error_undo;
+ }
+
+ ret = btrfs_rm_dev_item(trans, device);
+ if (ret) {
+ /* Any error in dev item removal is critical */
+ btrfs_crit(fs_info,
+ "failed to remove device item for devid %llu: %d",
+ device->devid, ret);
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ return ret;
+ }
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
btrfs_scrub_cancel_dev(device);
@@ -2229,7 +2219,8 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
free_fs_devices(cur_devices);
}
-out:
+ ret = btrfs_commit_transaction(trans);
+
return ret;
error_undo:
@@ -2240,7 +2231,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
device->fs_devices->rw_devices++;
mutex_unlock(&fs_info->chunk_mutex);
}
- goto out;
+ return ret;
}
void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev)
Hi,
Can you queue up this one for 5.10 stable? It has an extra hunk
that's needed for 5.10. 5.15/16/17 will be a direct port of
the 5.18-rc patch.
commit e677edbcabee849bfdd43f1602bccbecf736a646
Author: Jens Axboe <axboe(a)kernel.dk>
Date: Fri Apr 8 11:08:58 2022 -0600
io_uring: fix race between timeout flush and removal
commit e677edbcabee849bfdd43f1602bccbecf736a646 upstream.
io_flush_timeouts() assumes the timeout isn't in progress of triggering
or being removed/canceled, so it unconditionally removes it from the
timeout list and attempts to cancel it.
Leave it on the list and let the normal timeout cancelation take care
of it.
Cc: stable(a)vger.kernel.org # 5.5+
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5959b0359524..bc6bf17566f8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1556,6 +1556,7 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx)
static void io_flush_timeouts(struct io_ring_ctx *ctx)
{
+ struct io_kiocb *req, *tmp;
u32 seq;
if (list_empty(&ctx->timeout_list))
@@ -1563,10 +1564,8 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
- do {
+ list_for_each_entry_safe(req, tmp, &ctx->timeout_list, timeout.list) {
u32 events_needed, events_got;
- struct io_kiocb *req = list_first_entry(&ctx->timeout_list,
- struct io_kiocb, timeout.list);
if (io_is_timeout_noseq(req))
break;
@@ -1583,9 +1582,8 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
if (events_got < events_needed)
break;
- list_del_init(&req->timeout.list);
io_kill_timeout(req, 0);
- } while (!list_empty(&ctx->timeout_list));
+ }
ctx->cq_last_tm_flush = seq;
}
@@ -5639,6 +5637,7 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
else
data->mode = HRTIMER_MODE_REL;
+ INIT_LIST_HEAD(&req->timeout.list);
hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode);
return 0;
}
@@ -6282,12 +6281,12 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
if (!list_empty(&req->link_list)) {
prev = list_entry(req->link_list.prev, struct io_kiocb,
link_list);
- if (refcount_inc_not_zero(&prev->refs))
- list_del_init(&req->link_list);
- else
+ list_del_init(&req->link_list);
+ if (!refcount_inc_not_zero(&prev->refs))
prev = NULL;
}
+ list_del(&req->timeout.list);
spin_unlock_irqrestore(&ctx->completion_lock, flags);
if (prev) {
--
Jens Axboe
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 6bf9c47a398911e0ab920e362115153596c80432 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Tue, 29 Mar 2022 10:10:08 -0600
Subject: [PATCH] io_uring: defer file assignment
If an application uses direct open or accept, it knows in advance what
direct descriptor value it will get as it picks it itself. This allows
combined requests such as:
sqe = io_uring_get_sqe(ring);
io_uring_prep_openat_direct(sqe, ..., file_slot);
sqe->flags |= IOSQE_IO_LINK | IOSQE_CQE_SKIP_SUCCESS;
sqe = io_uring_get_sqe(ring);
io_uring_prep_read(sqe,file_slot, buf, buf_size, 0);
sqe->flags |= IOSQE_FIXED_FILE;
io_uring_submit(ring);
where we prepare both a file open and read, and only get a completion
event for the read when both have completed successfully.
Currently links are fully prepared before the head is issued, but that
fails if the dependent link needs a file assigned that isn't valid until
the head has completed.
Conversely, if the same chain is performed but the fixed file slot is
already valid, then we would be unexpectedly returning data from the
old file slot rather than the newly opened one. Make sure we're
consistent here.
Allow deferral of file setup, which makes this documented case work.
Cc: stable(a)vger.kernel.org # v5.15+
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.h b/fs/io-wq.h
index dbecd27656c7..04d374e65e54 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -155,6 +155,7 @@ struct io_wq_work_node *wq_stack_extract(struct io_wq_work_node *stack)
struct io_wq_work {
struct io_wq_work_node list;
unsigned flags;
+ int fd;
};
static inline struct io_wq_work *wq_next_work(struct io_wq_work *work)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 398128db9728..bdc090fec29c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7240,6 +7240,23 @@ static void io_clean_op(struct io_kiocb *req)
req->flags &= ~IO_REQ_CLEAN_FLAGS;
}
+static bool io_assign_file(struct io_kiocb *req, unsigned int issue_flags)
+{
+ if (req->file || !io_op_defs[req->opcode].needs_file)
+ return true;
+
+ if (req->flags & REQ_F_FIXED_FILE)
+ req->file = io_file_get_fixed(req, req->work.fd, issue_flags);
+ else
+ req->file = io_file_get_normal(req, req->work.fd);
+ if (req->file)
+ return true;
+
+ req_set_fail(req);
+ req->result = -EBADF;
+ return false;
+}
+
static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
{
const struct cred *creds = NULL;
@@ -7250,6 +7267,8 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
if (!io_op_defs[req->opcode].audit_skip)
audit_uring_entry(req->opcode);
+ if (unlikely(!io_assign_file(req, issue_flags)))
+ return -EBADF;
switch (req->opcode) {
case IORING_OP_NOP:
@@ -7394,10 +7413,11 @@ static struct io_wq_work *io_wq_free_work(struct io_wq_work *work)
static void io_wq_submit_work(struct io_wq_work *work)
{
struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+ const struct io_op_def *def = &io_op_defs[req->opcode];
unsigned int issue_flags = IO_URING_F_UNLOCKED;
bool needs_poll = false;
struct io_kiocb *timeout;
- int ret = 0;
+ int ret = 0, err = -ECANCELED;
/* one will be dropped by ->io_free_work() after returning to io-wq */
if (!(req->flags & REQ_F_REFCOUNT))
@@ -7409,14 +7429,18 @@ static void io_wq_submit_work(struct io_wq_work *work)
if (timeout)
io_queue_linked_timeout(timeout);
+ if (!io_assign_file(req, issue_flags)) {
+ err = -EBADF;
+ work->flags |= IO_WQ_WORK_CANCEL;
+ }
+
/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
if (work->flags & IO_WQ_WORK_CANCEL) {
- io_req_task_queue_fail(req, -ECANCELED);
+ io_req_task_queue_fail(req, err);
return;
}
if (req->flags & REQ_F_FORCE_ASYNC) {
- const struct io_op_def *def = &io_op_defs[req->opcode];
bool opcode_poll = def->pollin || def->pollout;
if (opcode_poll && file_can_poll(req->file)) {
@@ -7749,6 +7773,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
if (io_op_defs[opcode].needs_file) {
struct io_submit_state *state = &ctx->submit_state;
+ req->work.fd = READ_ONCE(sqe->fd);
+
/*
* Plug now if we have more than 2 IO left after this, and the
* target is potentially a read/write to block based storage.
@@ -7758,13 +7784,6 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
state->need_plug = false;
blk_start_plug_nr_ios(&state->plug, state->submit_nr);
}
-
- if (req->flags & REQ_F_FIXED_FILE)
- req->file = io_file_get_fixed(req, READ_ONCE(sqe->fd), 0);
- else
- req->file = io_file_get_normal(req, READ_ONCE(sqe->fd));
- if (unlikely(!req->file))
- return -EBADF;
}
personality = READ_ONCE(sqe->personality);
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 6bf9c47a398911e0ab920e362115153596c80432 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Tue, 29 Mar 2022 10:10:08 -0600
Subject: [PATCH] io_uring: defer file assignment
If an application uses direct open or accept, it knows in advance what
direct descriptor value it will get as it picks it itself. This allows
combined requests such as:
sqe = io_uring_get_sqe(ring);
io_uring_prep_openat_direct(sqe, ..., file_slot);
sqe->flags |= IOSQE_IO_LINK | IOSQE_CQE_SKIP_SUCCESS;
sqe = io_uring_get_sqe(ring);
io_uring_prep_read(sqe,file_slot, buf, buf_size, 0);
sqe->flags |= IOSQE_FIXED_FILE;
io_uring_submit(ring);
where we prepare both a file open and read, and only get a completion
event for the read when both have completed successfully.
Currently links are fully prepared before the head is issued, but that
fails if the dependent link needs a file assigned that isn't valid until
the head has completed.
Conversely, if the same chain is performed but the fixed file slot is
already valid, then we would be unexpectedly returning data from the
old file slot rather than the newly opened one. Make sure we're
consistent here.
Allow deferral of file setup, which makes this documented case work.
Cc: stable(a)vger.kernel.org # v5.15+
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.h b/fs/io-wq.h
index dbecd27656c7..04d374e65e54 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -155,6 +155,7 @@ struct io_wq_work_node *wq_stack_extract(struct io_wq_work_node *stack)
struct io_wq_work {
struct io_wq_work_node list;
unsigned flags;
+ int fd;
};
static inline struct io_wq_work *wq_next_work(struct io_wq_work *work)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 398128db9728..bdc090fec29c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7240,6 +7240,23 @@ static void io_clean_op(struct io_kiocb *req)
req->flags &= ~IO_REQ_CLEAN_FLAGS;
}
+static bool io_assign_file(struct io_kiocb *req, unsigned int issue_flags)
+{
+ if (req->file || !io_op_defs[req->opcode].needs_file)
+ return true;
+
+ if (req->flags & REQ_F_FIXED_FILE)
+ req->file = io_file_get_fixed(req, req->work.fd, issue_flags);
+ else
+ req->file = io_file_get_normal(req, req->work.fd);
+ if (req->file)
+ return true;
+
+ req_set_fail(req);
+ req->result = -EBADF;
+ return false;
+}
+
static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
{
const struct cred *creds = NULL;
@@ -7250,6 +7267,8 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
if (!io_op_defs[req->opcode].audit_skip)
audit_uring_entry(req->opcode);
+ if (unlikely(!io_assign_file(req, issue_flags)))
+ return -EBADF;
switch (req->opcode) {
case IORING_OP_NOP:
@@ -7394,10 +7413,11 @@ static struct io_wq_work *io_wq_free_work(struct io_wq_work *work)
static void io_wq_submit_work(struct io_wq_work *work)
{
struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+ const struct io_op_def *def = &io_op_defs[req->opcode];
unsigned int issue_flags = IO_URING_F_UNLOCKED;
bool needs_poll = false;
struct io_kiocb *timeout;
- int ret = 0;
+ int ret = 0, err = -ECANCELED;
/* one will be dropped by ->io_free_work() after returning to io-wq */
if (!(req->flags & REQ_F_REFCOUNT))
@@ -7409,14 +7429,18 @@ static void io_wq_submit_work(struct io_wq_work *work)
if (timeout)
io_queue_linked_timeout(timeout);
+ if (!io_assign_file(req, issue_flags)) {
+ err = -EBADF;
+ work->flags |= IO_WQ_WORK_CANCEL;
+ }
+
/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
if (work->flags & IO_WQ_WORK_CANCEL) {
- io_req_task_queue_fail(req, -ECANCELED);
+ io_req_task_queue_fail(req, err);
return;
}
if (req->flags & REQ_F_FORCE_ASYNC) {
- const struct io_op_def *def = &io_op_defs[req->opcode];
bool opcode_poll = def->pollin || def->pollout;
if (opcode_poll && file_can_poll(req->file)) {
@@ -7749,6 +7773,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
if (io_op_defs[opcode].needs_file) {
struct io_submit_state *state = &ctx->submit_state;
+ req->work.fd = READ_ONCE(sqe->fd);
+
/*
* Plug now if we have more than 2 IO left after this, and the
* target is potentially a read/write to block based storage.
@@ -7758,13 +7784,6 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
state->need_plug = false;
blk_start_plug_nr_ios(&state->plug, state->submit_nr);
}
-
- if (req->flags & REQ_F_FIXED_FILE)
- req->file = io_file_get_fixed(req, READ_ONCE(sqe->fd), 0);
- else
- req->file = io_file_get_normal(req, READ_ONCE(sqe->fd));
- if (unlikely(!req->file))
- return -EBADF;
}
personality = READ_ONCE(sqe->personality);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 6bf9c47a398911e0ab920e362115153596c80432 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Tue, 29 Mar 2022 10:10:08 -0600
Subject: [PATCH] io_uring: defer file assignment
If an application uses direct open or accept, it knows in advance what
direct descriptor value it will get as it picks it itself. This allows
combined requests such as:
sqe = io_uring_get_sqe(ring);
io_uring_prep_openat_direct(sqe, ..., file_slot);
sqe->flags |= IOSQE_IO_LINK | IOSQE_CQE_SKIP_SUCCESS;
sqe = io_uring_get_sqe(ring);
io_uring_prep_read(sqe,file_slot, buf, buf_size, 0);
sqe->flags |= IOSQE_FIXED_FILE;
io_uring_submit(ring);
where we prepare both a file open and read, and only get a completion
event for the read when both have completed successfully.
Currently links are fully prepared before the head is issued, but that
fails if the dependent link needs a file assigned that isn't valid until
the head has completed.
Conversely, if the same chain is performed but the fixed file slot is
already valid, then we would be unexpectedly returning data from the
old file slot rather than the newly opened one. Make sure we're
consistent here.
Allow deferral of file setup, which makes this documented case work.
Cc: stable(a)vger.kernel.org # v5.15+
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io-wq.h b/fs/io-wq.h
index dbecd27656c7..04d374e65e54 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -155,6 +155,7 @@ struct io_wq_work_node *wq_stack_extract(struct io_wq_work_node *stack)
struct io_wq_work {
struct io_wq_work_node list;
unsigned flags;
+ int fd;
};
static inline struct io_wq_work *wq_next_work(struct io_wq_work *work)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 398128db9728..bdc090fec29c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7240,6 +7240,23 @@ static void io_clean_op(struct io_kiocb *req)
req->flags &= ~IO_REQ_CLEAN_FLAGS;
}
+static bool io_assign_file(struct io_kiocb *req, unsigned int issue_flags)
+{
+ if (req->file || !io_op_defs[req->opcode].needs_file)
+ return true;
+
+ if (req->flags & REQ_F_FIXED_FILE)
+ req->file = io_file_get_fixed(req, req->work.fd, issue_flags);
+ else
+ req->file = io_file_get_normal(req, req->work.fd);
+ if (req->file)
+ return true;
+
+ req_set_fail(req);
+ req->result = -EBADF;
+ return false;
+}
+
static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
{
const struct cred *creds = NULL;
@@ -7250,6 +7267,8 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
if (!io_op_defs[req->opcode].audit_skip)
audit_uring_entry(req->opcode);
+ if (unlikely(!io_assign_file(req, issue_flags)))
+ return -EBADF;
switch (req->opcode) {
case IORING_OP_NOP:
@@ -7394,10 +7413,11 @@ static struct io_wq_work *io_wq_free_work(struct io_wq_work *work)
static void io_wq_submit_work(struct io_wq_work *work)
{
struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+ const struct io_op_def *def = &io_op_defs[req->opcode];
unsigned int issue_flags = IO_URING_F_UNLOCKED;
bool needs_poll = false;
struct io_kiocb *timeout;
- int ret = 0;
+ int ret = 0, err = -ECANCELED;
/* one will be dropped by ->io_free_work() after returning to io-wq */
if (!(req->flags & REQ_F_REFCOUNT))
@@ -7409,14 +7429,18 @@ static void io_wq_submit_work(struct io_wq_work *work)
if (timeout)
io_queue_linked_timeout(timeout);
+ if (!io_assign_file(req, issue_flags)) {
+ err = -EBADF;
+ work->flags |= IO_WQ_WORK_CANCEL;
+ }
+
/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
if (work->flags & IO_WQ_WORK_CANCEL) {
- io_req_task_queue_fail(req, -ECANCELED);
+ io_req_task_queue_fail(req, err);
return;
}
if (req->flags & REQ_F_FORCE_ASYNC) {
- const struct io_op_def *def = &io_op_defs[req->opcode];
bool opcode_poll = def->pollin || def->pollout;
if (opcode_poll && file_can_poll(req->file)) {
@@ -7749,6 +7773,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
if (io_op_defs[opcode].needs_file) {
struct io_submit_state *state = &ctx->submit_state;
+ req->work.fd = READ_ONCE(sqe->fd);
+
/*
* Plug now if we have more than 2 IO left after this, and the
* target is potentially a read/write to block based storage.
@@ -7758,13 +7784,6 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
state->need_plug = false;
blk_start_plug_nr_ios(&state->plug, state->submit_nr);
}
-
- if (req->flags & REQ_F_FIXED_FILE)
- req->file = io_file_get_fixed(req, READ_ONCE(sqe->fd), 0);
- else
- req->file = io_file_get_normal(req, READ_ONCE(sqe->fd));
- if (unlikely(!req->file))
- return -EBADF;
}
personality = READ_ONCE(sqe->personality);
[BUG]
When running generic/475 with 64K page size and 4K sector size, it has a
very high chance (almost 100%) to hang, with mostly data page locked but
no one is going to unlock it.
[CAUSE]
With commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads"), if we failed to lookup checksum due to metadata IO error, we
will return error for btrfs_submit_data_bio().
This will cause the page to be unlocked twice in btrfs_do_readpage():
btrfs_do_readpage()
|- submit_extent_page()
| |- submit_one_bio()
| |- btrfs_submit_data_bio()
| |- if (ret) {
| |- bio->bi_status = ret;
| |- bio_endio(bio); }
| In the endio function, we will call end_page_read()
| and unlock_extent() to cleanup the subpage range.
|
|- if (ret) {
|- unlock_extent(); end_page_read() }
Here we unlock the extent and cleanup the subpage range
again.
For unlock_extent(), it's mostly double unlock safe.
But for end_page_read(), it's not, especially for subpage case,
as for subpage case we will call btrfs_subpage_end_reader() to reduce
the reader number, and use that to number to determine if we need to
unlock the full page.
If double accounted, it can underflow the number and leave the page
locked without anyone to unlock it.
[FIX]
The commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads") itself is completely fine, it's our existing code not properly
handling the error from bio submission hook properly.
Since one and only one of the endio function or the submit_extent_page()
caller should do the cleanup, let's just ignore the return value from
bio submission hook.
Before the hook, it's callers' responsibility to do cleanup, after the
hook (including inside the hook), it's the endio doing the cleanup.
This patch is makes submit_one_bio() always return 0, but still keep the
old int return value to make minimal change for backport.
CC: stable(a)vger.kernel.org # 5.18+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
fs/btrfs/extent_io.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 53b59944013f..34073b0ed6ca 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -165,24 +165,28 @@ static int add_extent_changeset(struct extent_state *state, u32 bits,
return ret;
}
-int __must_check submit_one_bio(struct bio *bio, int mirror_num,
- unsigned long bio_flags)
+int submit_one_bio(struct bio *bio, int mirror_num, unsigned long bio_flags)
{
- blk_status_t ret = 0;
struct extent_io_tree *tree = bio->bi_private;
bio->bi_private = NULL;
/* Caller should ensure the bio has at least some range added */
ASSERT(bio->bi_iter.bi_size);
+
if (is_data_inode(tree->private_data))
- ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
+ btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
bio_flags);
else
- ret = btrfs_submit_metadata_bio(tree->private_data, bio,
+ btrfs_submit_metadata_bio(tree->private_data, bio,
mirror_num, bio_flags);
-
- return blk_status_to_errno(ret);
+ /*
+ * Above submission hooks will handle the error by ending the bio,
+ * which will do the cleanup properly.
+ * So here we should not return any error, or the caller of
+ * submit_extent_page() will do cleanup again, causing problems.
+ */
+ return 0;
}
/* Cleanup unsubmitted bios */
--
2.35.1
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5d435933376962b107bd76970912e7e80247dcc7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20L=C3=B6hle?= <CLoehle(a)hyperstone.com>
Date: Thu, 24 Mar 2022 14:18:41 +0000
Subject: [PATCH] mmc: block: Check for errors after write on SPI
Introduce a SEND_STATUS check for writes through SPI to not mark
an unsuccessful write as successful.
Since SPI SD/MMC does not have states, after a write, the card will
just hold the line LOW until it is ready again. The driver marks the
write therefore as completed as soon as it reads something other than
all zeroes.
The driver does not distinguish from a card no longer signalling busy
and it being disconnected (and the line being pulled-up by the host).
This lead to writes being marked as successful when disconnecting
a busy card.
Now the card is ensured to be still connected by an additional CMD13,
just like non-SPI is ensured to go back to TRAN state.
While at it and since we already poll for the post-write status anyway,
we might as well check for SPIs error bits (any of them).
The disconnecting card problem is reproducable for me after continuous
write activity and randomly disconnecting, around every 20-50 tries
on SPI DS for some card.
Fixes: 7213d175e3b6f ("MMC/SD card driver learns SPI")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christian Loehle <cloehle(a)hyperstone.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Link: https://lore.kernel.org/r/76f6f5d2b35543bab3dfe438f268609c@hyperstone.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 4e67c1403cc9..be2078684417 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1880,6 +1880,31 @@ static inline bool mmc_blk_rq_error(struct mmc_blk_request *brq)
brq->data.error || brq->cmd.resp[0] & CMD_ERRORS;
}
+static int mmc_spi_err_check(struct mmc_card *card)
+{
+ u32 status = 0;
+ int err;
+
+ /*
+ * SPI does not have a TRAN state we have to wait on, instead the
+ * card is ready again when it no longer holds the line LOW.
+ * We still have to ensure two things here before we know the write
+ * was successful:
+ * 1. The card has not disconnected during busy and we actually read our
+ * own pull-up, thinking it was still connected, so ensure it
+ * still responds.
+ * 2. Check for any error bits, in particular R1_SPI_IDLE to catch a
+ * just reconnected card after being disconnected during busy.
+ */
+ err = __mmc_send_status(card, &status, 0);
+ if (err)
+ return err;
+ /* All R1 and R2 bits of SPI are errors in our case */
+ if (status)
+ return -EIO;
+ return 0;
+}
+
static int mmc_blk_busy_cb(void *cb_data, bool *busy)
{
struct mmc_blk_busy_data *data = cb_data;
@@ -1903,9 +1928,16 @@ static int mmc_blk_card_busy(struct mmc_card *card, struct request *req)
struct mmc_blk_busy_data cb_data;
int err;
- if (mmc_host_is_spi(card->host) || rq_data_dir(req) == READ)
+ if (rq_data_dir(req) == READ)
return 0;
+ if (mmc_host_is_spi(card->host)) {
+ err = mmc_spi_err_check(card);
+ if (err)
+ mqrq->brq.data.bytes_xfered = 0;
+ return err;
+ }
+
cb_data.card = card;
cb_data.status = 0;
err = __mmc_poll_for_busy(card->host, 0, MMC_BLK_TIMEOUT_MS,
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5d435933376962b107bd76970912e7e80247dcc7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20L=C3=B6hle?= <CLoehle(a)hyperstone.com>
Date: Thu, 24 Mar 2022 14:18:41 +0000
Subject: [PATCH] mmc: block: Check for errors after write on SPI
Introduce a SEND_STATUS check for writes through SPI to not mark
an unsuccessful write as successful.
Since SPI SD/MMC does not have states, after a write, the card will
just hold the line LOW until it is ready again. The driver marks the
write therefore as completed as soon as it reads something other than
all zeroes.
The driver does not distinguish from a card no longer signalling busy
and it being disconnected (and the line being pulled-up by the host).
This lead to writes being marked as successful when disconnecting
a busy card.
Now the card is ensured to be still connected by an additional CMD13,
just like non-SPI is ensured to go back to TRAN state.
While at it and since we already poll for the post-write status anyway,
we might as well check for SPIs error bits (any of them).
The disconnecting card problem is reproducable for me after continuous
write activity and randomly disconnecting, around every 20-50 tries
on SPI DS for some card.
Fixes: 7213d175e3b6f ("MMC/SD card driver learns SPI")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christian Loehle <cloehle(a)hyperstone.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Link: https://lore.kernel.org/r/76f6f5d2b35543bab3dfe438f268609c@hyperstone.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 4e67c1403cc9..be2078684417 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1880,6 +1880,31 @@ static inline bool mmc_blk_rq_error(struct mmc_blk_request *brq)
brq->data.error || brq->cmd.resp[0] & CMD_ERRORS;
}
+static int mmc_spi_err_check(struct mmc_card *card)
+{
+ u32 status = 0;
+ int err;
+
+ /*
+ * SPI does not have a TRAN state we have to wait on, instead the
+ * card is ready again when it no longer holds the line LOW.
+ * We still have to ensure two things here before we know the write
+ * was successful:
+ * 1. The card has not disconnected during busy and we actually read our
+ * own pull-up, thinking it was still connected, so ensure it
+ * still responds.
+ * 2. Check for any error bits, in particular R1_SPI_IDLE to catch a
+ * just reconnected card after being disconnected during busy.
+ */
+ err = __mmc_send_status(card, &status, 0);
+ if (err)
+ return err;
+ /* All R1 and R2 bits of SPI are errors in our case */
+ if (status)
+ return -EIO;
+ return 0;
+}
+
static int mmc_blk_busy_cb(void *cb_data, bool *busy)
{
struct mmc_blk_busy_data *data = cb_data;
@@ -1903,9 +1928,16 @@ static int mmc_blk_card_busy(struct mmc_card *card, struct request *req)
struct mmc_blk_busy_data cb_data;
int err;
- if (mmc_host_is_spi(card->host) || rq_data_dir(req) == READ)
+ if (rq_data_dir(req) == READ)
return 0;
+ if (mmc_host_is_spi(card->host)) {
+ err = mmc_spi_err_check(card);
+ if (err)
+ mqrq->brq.data.bytes_xfered = 0;
+ return err;
+ }
+
cb_data.card = card;
cb_data.status = 0;
err = __mmc_poll_for_busy(card->host, 0, MMC_BLK_TIMEOUT_MS,
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5d435933376962b107bd76970912e7e80247dcc7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20L=C3=B6hle?= <CLoehle(a)hyperstone.com>
Date: Thu, 24 Mar 2022 14:18:41 +0000
Subject: [PATCH] mmc: block: Check for errors after write on SPI
Introduce a SEND_STATUS check for writes through SPI to not mark
an unsuccessful write as successful.
Since SPI SD/MMC does not have states, after a write, the card will
just hold the line LOW until it is ready again. The driver marks the
write therefore as completed as soon as it reads something other than
all zeroes.
The driver does not distinguish from a card no longer signalling busy
and it being disconnected (and the line being pulled-up by the host).
This lead to writes being marked as successful when disconnecting
a busy card.
Now the card is ensured to be still connected by an additional CMD13,
just like non-SPI is ensured to go back to TRAN state.
While at it and since we already poll for the post-write status anyway,
we might as well check for SPIs error bits (any of them).
The disconnecting card problem is reproducable for me after continuous
write activity and randomly disconnecting, around every 20-50 tries
on SPI DS for some card.
Fixes: 7213d175e3b6f ("MMC/SD card driver learns SPI")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christian Loehle <cloehle(a)hyperstone.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Link: https://lore.kernel.org/r/76f6f5d2b35543bab3dfe438f268609c@hyperstone.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 4e67c1403cc9..be2078684417 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1880,6 +1880,31 @@ static inline bool mmc_blk_rq_error(struct mmc_blk_request *brq)
brq->data.error || brq->cmd.resp[0] & CMD_ERRORS;
}
+static int mmc_spi_err_check(struct mmc_card *card)
+{
+ u32 status = 0;
+ int err;
+
+ /*
+ * SPI does not have a TRAN state we have to wait on, instead the
+ * card is ready again when it no longer holds the line LOW.
+ * We still have to ensure two things here before we know the write
+ * was successful:
+ * 1. The card has not disconnected during busy and we actually read our
+ * own pull-up, thinking it was still connected, so ensure it
+ * still responds.
+ * 2. Check for any error bits, in particular R1_SPI_IDLE to catch a
+ * just reconnected card after being disconnected during busy.
+ */
+ err = __mmc_send_status(card, &status, 0);
+ if (err)
+ return err;
+ /* All R1 and R2 bits of SPI are errors in our case */
+ if (status)
+ return -EIO;
+ return 0;
+}
+
static int mmc_blk_busy_cb(void *cb_data, bool *busy)
{
struct mmc_blk_busy_data *data = cb_data;
@@ -1903,9 +1928,16 @@ static int mmc_blk_card_busy(struct mmc_card *card, struct request *req)
struct mmc_blk_busy_data cb_data;
int err;
- if (mmc_host_is_spi(card->host) || rq_data_dir(req) == READ)
+ if (rq_data_dir(req) == READ)
return 0;
+ if (mmc_host_is_spi(card->host)) {
+ err = mmc_spi_err_check(card);
+ if (err)
+ mqrq->brq.data.bytes_xfered = 0;
+ return err;
+ }
+
cb_data.card = card;
cb_data.status = 0;
err = __mmc_poll_for_busy(card->host, 0, MMC_BLK_TIMEOUT_MS,
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5d435933376962b107bd76970912e7e80247dcc7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20L=C3=B6hle?= <CLoehle(a)hyperstone.com>
Date: Thu, 24 Mar 2022 14:18:41 +0000
Subject: [PATCH] mmc: block: Check for errors after write on SPI
Introduce a SEND_STATUS check for writes through SPI to not mark
an unsuccessful write as successful.
Since SPI SD/MMC does not have states, after a write, the card will
just hold the line LOW until it is ready again. The driver marks the
write therefore as completed as soon as it reads something other than
all zeroes.
The driver does not distinguish from a card no longer signalling busy
and it being disconnected (and the line being pulled-up by the host).
This lead to writes being marked as successful when disconnecting
a busy card.
Now the card is ensured to be still connected by an additional CMD13,
just like non-SPI is ensured to go back to TRAN state.
While at it and since we already poll for the post-write status anyway,
we might as well check for SPIs error bits (any of them).
The disconnecting card problem is reproducable for me after continuous
write activity and randomly disconnecting, around every 20-50 tries
on SPI DS for some card.
Fixes: 7213d175e3b6f ("MMC/SD card driver learns SPI")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christian Loehle <cloehle(a)hyperstone.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Link: https://lore.kernel.org/r/76f6f5d2b35543bab3dfe438f268609c@hyperstone.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 4e67c1403cc9..be2078684417 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1880,6 +1880,31 @@ static inline bool mmc_blk_rq_error(struct mmc_blk_request *brq)
brq->data.error || brq->cmd.resp[0] & CMD_ERRORS;
}
+static int mmc_spi_err_check(struct mmc_card *card)
+{
+ u32 status = 0;
+ int err;
+
+ /*
+ * SPI does not have a TRAN state we have to wait on, instead the
+ * card is ready again when it no longer holds the line LOW.
+ * We still have to ensure two things here before we know the write
+ * was successful:
+ * 1. The card has not disconnected during busy and we actually read our
+ * own pull-up, thinking it was still connected, so ensure it
+ * still responds.
+ * 2. Check for any error bits, in particular R1_SPI_IDLE to catch a
+ * just reconnected card after being disconnected during busy.
+ */
+ err = __mmc_send_status(card, &status, 0);
+ if (err)
+ return err;
+ /* All R1 and R2 bits of SPI are errors in our case */
+ if (status)
+ return -EIO;
+ return 0;
+}
+
static int mmc_blk_busy_cb(void *cb_data, bool *busy)
{
struct mmc_blk_busy_data *data = cb_data;
@@ -1903,9 +1928,16 @@ static int mmc_blk_card_busy(struct mmc_card *card, struct request *req)
struct mmc_blk_busy_data cb_data;
int err;
- if (mmc_host_is_spi(card->host) || rq_data_dir(req) == READ)
+ if (rq_data_dir(req) == READ)
return 0;
+ if (mmc_host_is_spi(card->host)) {
+ err = mmc_spi_err_check(card);
+ if (err)
+ mqrq->brq.data.bytes_xfered = 0;
+ return err;
+ }
+
cb_data.card = card;
cb_data.status = 0;
err = __mmc_poll_for_busy(card->host, 0, MMC_BLK_TIMEOUT_MS,
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5d435933376962b107bd76970912e7e80247dcc7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20L=C3=B6hle?= <CLoehle(a)hyperstone.com>
Date: Thu, 24 Mar 2022 14:18:41 +0000
Subject: [PATCH] mmc: block: Check for errors after write on SPI
Introduce a SEND_STATUS check for writes through SPI to not mark
an unsuccessful write as successful.
Since SPI SD/MMC does not have states, after a write, the card will
just hold the line LOW until it is ready again. The driver marks the
write therefore as completed as soon as it reads something other than
all zeroes.
The driver does not distinguish from a card no longer signalling busy
and it being disconnected (and the line being pulled-up by the host).
This lead to writes being marked as successful when disconnecting
a busy card.
Now the card is ensured to be still connected by an additional CMD13,
just like non-SPI is ensured to go back to TRAN state.
While at it and since we already poll for the post-write status anyway,
we might as well check for SPIs error bits (any of them).
The disconnecting card problem is reproducable for me after continuous
write activity and randomly disconnecting, around every 20-50 tries
on SPI DS for some card.
Fixes: 7213d175e3b6f ("MMC/SD card driver learns SPI")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christian Loehle <cloehle(a)hyperstone.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Link: https://lore.kernel.org/r/76f6f5d2b35543bab3dfe438f268609c@hyperstone.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 4e67c1403cc9..be2078684417 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1880,6 +1880,31 @@ static inline bool mmc_blk_rq_error(struct mmc_blk_request *brq)
brq->data.error || brq->cmd.resp[0] & CMD_ERRORS;
}
+static int mmc_spi_err_check(struct mmc_card *card)
+{
+ u32 status = 0;
+ int err;
+
+ /*
+ * SPI does not have a TRAN state we have to wait on, instead the
+ * card is ready again when it no longer holds the line LOW.
+ * We still have to ensure two things here before we know the write
+ * was successful:
+ * 1. The card has not disconnected during busy and we actually read our
+ * own pull-up, thinking it was still connected, so ensure it
+ * still responds.
+ * 2. Check for any error bits, in particular R1_SPI_IDLE to catch a
+ * just reconnected card after being disconnected during busy.
+ */
+ err = __mmc_send_status(card, &status, 0);
+ if (err)
+ return err;
+ /* All R1 and R2 bits of SPI are errors in our case */
+ if (status)
+ return -EIO;
+ return 0;
+}
+
static int mmc_blk_busy_cb(void *cb_data, bool *busy)
{
struct mmc_blk_busy_data *data = cb_data;
@@ -1903,9 +1928,16 @@ static int mmc_blk_card_busy(struct mmc_card *card, struct request *req)
struct mmc_blk_busy_data cb_data;
int err;
- if (mmc_host_is_spi(card->host) || rq_data_dir(req) == READ)
+ if (rq_data_dir(req) == READ)
return 0;
+ if (mmc_host_is_spi(card->host)) {
+ err = mmc_spi_err_check(card);
+ if (err)
+ mqrq->brq.data.bytes_xfered = 0;
+ return err;
+ }
+
cb_data.card = card;
cb_data.status = 0;
err = __mmc_poll_for_busy(card->host, 0, MMC_BLK_TIMEOUT_MS,
The patch below does not apply to the 5.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7294a9bcaa7ee0d3b96aab1a277317315fd46f09 Mon Sep 17 00:00:00 2001
From: James Smart <jsmart2021(a)gmail.com>
Date: Wed, 23 Mar 2022 13:55:44 -0700
Subject: [PATCH] scsi: lpfc: Fix broken SLI4 abort path
There was a merge error in ther 14.2.0.0 patches that resulted in the SLI4
path using the SLI3 issue_abort_iotag() routine. This resulted in txcmplq
corruption.
Fix to use the SLI4 routine when SLI4.
Link: https://lore.kernel.org/r/20220323205545.81814-2-jsmart2021@gmail.com
Fixes: 31a59f75702f ("scsi: lpfc: SLI path split: Refactor Abort paths")
Cc: <stable(a)vger.kernel.org> # v5.2+
Co-developed-by: Dick Kennedy <dick.kennedy(a)broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy(a)broadcom.com>
Signed-off-by: James Smart <jsmart2021(a)gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 3c132604fd91..ba9dbb51b75f 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -5929,13 +5929,15 @@ lpfc_abort_handler(struct scsi_cmnd *cmnd)
}
lpfc_cmd->waitq = &waitq;
- if (phba->sli_rev == LPFC_SLI_REV4)
+ if (phba->sli_rev == LPFC_SLI_REV4) {
spin_unlock(&pring_s4->ring_lock);
- else
+ ret_val = lpfc_sli4_issue_abort_iotag(phba, iocb,
+ lpfc_sli_abort_fcp_cmpl);
+ } else {
pring = &phba->sli.sli3_ring[LPFC_FCP_RING];
-
- ret_val = lpfc_sli_issue_abort_iotag(phba, pring, iocb,
- lpfc_sli_abort_fcp_cmpl);
+ ret_val = lpfc_sli_issue_abort_iotag(phba, pring, iocb,
+ lpfc_sli_abort_fcp_cmpl);
+ }
/* Make sure HBA is alive */
lpfc_issue_hb_tmo(phba);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7294a9bcaa7ee0d3b96aab1a277317315fd46f09 Mon Sep 17 00:00:00 2001
From: James Smart <jsmart2021(a)gmail.com>
Date: Wed, 23 Mar 2022 13:55:44 -0700
Subject: [PATCH] scsi: lpfc: Fix broken SLI4 abort path
There was a merge error in ther 14.2.0.0 patches that resulted in the SLI4
path using the SLI3 issue_abort_iotag() routine. This resulted in txcmplq
corruption.
Fix to use the SLI4 routine when SLI4.
Link: https://lore.kernel.org/r/20220323205545.81814-2-jsmart2021@gmail.com
Fixes: 31a59f75702f ("scsi: lpfc: SLI path split: Refactor Abort paths")
Cc: <stable(a)vger.kernel.org> # v5.2+
Co-developed-by: Dick Kennedy <dick.kennedy(a)broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy(a)broadcom.com>
Signed-off-by: James Smart <jsmart2021(a)gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 3c132604fd91..ba9dbb51b75f 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -5929,13 +5929,15 @@ lpfc_abort_handler(struct scsi_cmnd *cmnd)
}
lpfc_cmd->waitq = &waitq;
- if (phba->sli_rev == LPFC_SLI_REV4)
+ if (phba->sli_rev == LPFC_SLI_REV4) {
spin_unlock(&pring_s4->ring_lock);
- else
+ ret_val = lpfc_sli4_issue_abort_iotag(phba, iocb,
+ lpfc_sli_abort_fcp_cmpl);
+ } else {
pring = &phba->sli.sli3_ring[LPFC_FCP_RING];
-
- ret_val = lpfc_sli_issue_abort_iotag(phba, pring, iocb,
- lpfc_sli_abort_fcp_cmpl);
+ ret_val = lpfc_sli_issue_abort_iotag(phba, pring, iocb,
+ lpfc_sli_abort_fcp_cmpl);
+ }
/* Make sure HBA is alive */
lpfc_issue_hb_tmo(phba);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7294a9bcaa7ee0d3b96aab1a277317315fd46f09 Mon Sep 17 00:00:00 2001
From: James Smart <jsmart2021(a)gmail.com>
Date: Wed, 23 Mar 2022 13:55:44 -0700
Subject: [PATCH] scsi: lpfc: Fix broken SLI4 abort path
There was a merge error in ther 14.2.0.0 patches that resulted in the SLI4
path using the SLI3 issue_abort_iotag() routine. This resulted in txcmplq
corruption.
Fix to use the SLI4 routine when SLI4.
Link: https://lore.kernel.org/r/20220323205545.81814-2-jsmart2021@gmail.com
Fixes: 31a59f75702f ("scsi: lpfc: SLI path split: Refactor Abort paths")
Cc: <stable(a)vger.kernel.org> # v5.2+
Co-developed-by: Dick Kennedy <dick.kennedy(a)broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy(a)broadcom.com>
Signed-off-by: James Smart <jsmart2021(a)gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 3c132604fd91..ba9dbb51b75f 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -5929,13 +5929,15 @@ lpfc_abort_handler(struct scsi_cmnd *cmnd)
}
lpfc_cmd->waitq = &waitq;
- if (phba->sli_rev == LPFC_SLI_REV4)
+ if (phba->sli_rev == LPFC_SLI_REV4) {
spin_unlock(&pring_s4->ring_lock);
- else
+ ret_val = lpfc_sli4_issue_abort_iotag(phba, iocb,
+ lpfc_sli_abort_fcp_cmpl);
+ } else {
pring = &phba->sli.sli3_ring[LPFC_FCP_RING];
-
- ret_val = lpfc_sli_issue_abort_iotag(phba, pring, iocb,
- lpfc_sli_abort_fcp_cmpl);
+ ret_val = lpfc_sli_issue_abort_iotag(phba, pring, iocb,
+ lpfc_sli_abort_fcp_cmpl);
+ }
/* Make sure HBA is alive */
lpfc_issue_hb_tmo(phba);