Certain AMD processors are vulnerable to a cross-thread return address
predictions bug. When running in SMT mode and one of the sibling threads
transitions out of C0 state, the other thread gets access to twice as many
entries in the RSB, but unfortunately the predictions of the now-halted
logical processor are not purged. Therefore, the executing processor
could speculatively execute from locations that the now-halted processor
had trained the RSB on.
The Spectre v2 mitigations cover the Linux kernel, as it fills the RSB
when context switching to the idle thread. However, KVM allows a VMM to
prevent exiting guest mode when transitioning out of C0 using the
KVM_CAP_X86_DISABLE_EXITS capability can be used by a VMM to change this
behavior. To mitigate the cross-thread return address predictions bug,
a VMM must not be allowed to override the default behavior to intercept
C0 transitions.
These patches introduce a KVM module parameter that, if set, will prevent
the user from disabling the HLT, MWAIT and CSTATE exits.
The patches apply to the 5.15 stable tree, and Greg has already received
them through a git bundle. The difference is only in context, but it is
too much for "git cherry-pick" so here they are.
Thanks,
Paolo
Tom Lendacky (3):
x86/speculation: Identify processors vulnerable to SMT RSB predictions
KVM: x86: Mitigate the cross-thread return address predictions bug
Documentation/hw-vuln: Add documentation for Cross-Thread Return
Predictions
.../admin-guide/hw-vuln/cross-thread-rsb.rst | 92 +++++++++++++++++++
Documentation/admin-guide/hw-vuln/index.rst | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/common.c | 9 +-
arch/x86/kvm/x86.c | 43 ++++++---
5 files changed, 133 insertions(+), 13 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/cross-thread-rsb.rst
--
2.39.1
When a grant entry is still in use by the remote domain, Linux must put
it on a deferred list. Normally, this list is very short, because
the PV network and block protocols expect the backend to unmap the grant
first. However, Qubes OS's GUI protocol is subject to the constraints
of the X Window System, and as such winds up with the frontend unmapping
the window first. As a result, the list can grow very large, resulting
in a massive memory leak and eventual VM freeze.
Fix this problem by bumping the number of entries that the VM will
attempt to free at each iteration to 10000. This is an ugly hack that
may well make a denial of service easier, but for Qubes OS that is less
bad than the problem Qubes OS users are facing today. There really
needs to be a way for a frontend to be notified when the backend has
unmapped the grants. Additionally, a module parameter is provided to
allow tuning the reclaim speed.
The code previously used printk(KERN_DEBUG) whenever it had to defer
reclaiming a page because the grant was still mapped. This resulted in
a large volume of log messages that bothered users. Use pr_debug
instead, which suppresses the messages by default. Developers can
enable them using the dynamic debug mechanism.
Fixes: QubesOS/qubes-issues#7410 (memory leak)
Fixes: QubesOS/qubes-issues#7359 (excessive logging)
Fixes: 569ca5b3f94c ("xen/gnttab: add deferred freeing logic")
Cc: stable(a)vger.kernel.org
Signed-off-by: Demi Marie Obenour <demi(a)invisiblethingslab.com>
---
Anyone have suggestions for improving the grant mechanism? Argo isn't
a good option, as in the GUI protocol there are substantial performance
wins to be had by using true shared memory. Resending as I forgot the
Signed-off-by on the first submission. Sorry about that.
drivers/xen/grant-table.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 5c83d41..2c2faa7 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -355,14 +355,20 @@
static void gnttab_handle_deferred(struct timer_list *);
static DEFINE_TIMER(deferred_timer, gnttab_handle_deferred);
+static atomic64_t deferred_count;
+static atomic64_t leaked_count;
+static unsigned int free_per_iteration = 10000;
+
static void gnttab_handle_deferred(struct timer_list *unused)
{
- unsigned int nr = 10;
+ unsigned int nr = READ_ONCE(free_per_iteration);
+ const bool ignore_limit = nr == 0;
struct deferred_entry *first = NULL;
unsigned long flags;
+ size_t freed = 0;
spin_lock_irqsave(&gnttab_list_lock, flags);
- while (nr--) {
+ while ((ignore_limit || nr--) && !list_empty(&deferred_list)) {
struct deferred_entry *entry
= list_first_entry(&deferred_list,
struct deferred_entry, list);
@@ -372,10 +378,13 @@
list_del(&entry->list);
spin_unlock_irqrestore(&gnttab_list_lock, flags);
if (_gnttab_end_foreign_access_ref(entry->ref)) {
+ uint64_t ret = atomic64_sub_return(1, &deferred_count);
put_free_entry(entry->ref);
- pr_debug("freeing g.e. %#x (pfn %#lx)\n",
- entry->ref, page_to_pfn(entry->page));
+ pr_debug("freeing g.e. %#x (pfn %#lx), %llu remaining\n",
+ entry->ref, page_to_pfn(entry->page),
+ (unsigned long long)ret);
put_page(entry->page);
+ freed++;
kfree(entry);
entry = NULL;
} else {
@@ -387,14 +396,15 @@
spin_lock_irqsave(&gnttab_list_lock, flags);
if (entry)
list_add_tail(&entry->list, &deferred_list);
- else if (list_empty(&deferred_list))
- break;
}
- if (!list_empty(&deferred_list) && !timer_pending(&deferred_timer)) {
+ if (list_empty(&deferred_list))
+ WARN_ON(atomic64_read(&deferred_count));
+ else if (!timer_pending(&deferred_timer)) {
deferred_timer.expires = jiffies + HZ;
add_timer(&deferred_timer);
}
spin_unlock_irqrestore(&gnttab_list_lock, flags);
+ pr_debug("Freed %zu references", freed);
}
static void gnttab_add_deferred(grant_ref_t ref, struct page *page)
@@ -402,7 +412,7 @@
{
struct deferred_entry *entry;
gfp_t gfp = (in_atomic() || irqs_disabled()) ? GFP_ATOMIC : GFP_KERNEL;
- const char *what = KERN_WARNING "leaking";
+ uint64_t leaked, deferred;
entry = kmalloc(sizeof(*entry), gfp);
if (!page) {
@@ -426,12 +436,20 @@
add_timer(&deferred_timer);
}
spin_unlock_irqrestore(&gnttab_list_lock, flags);
- what = KERN_DEBUG "deferring";
+ deferred = atomic64_add_return(1, &deferred_count);
+ leaked = atomic64_read(&leaked_count);
+ pr_debug("deferring g.e. %#x (pfn %#lx) (total deferred %llu, total leaked %llu)\n",
+ ref, page ? page_to_pfn(page) : -1, deferred, leaked);
+ } else {
+ deferred = atomic64_read(&deferred_count);
+ leaked = atomic64_add_return(1, &leaked_count);
+ pr_warn("leaking g.e. %#x (pfn %#lx) (total deferred %llu, total leaked %llu)\n",
+ ref, page ? page_to_pfn(page) : -1, deferred, leaked);
}
- printk("%s g.e. %#x (pfn %#lx)\n",
- what, ref, page ? page_to_pfn(page) : -1);
}
+module_param(free_per_iteration, uint, 0600);
+
int gnttab_try_end_foreign_access(grant_ref_t ref)
{
int ret = _gnttab_end_foreign_access_ref(ref);
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
Assert PCI Configuration Enable bit after probe. When this bit is left to
0 in the endpoint mode, the RK3399 PCIe endpoint core will generate
configuration request retry status (CRS) messages back to the root complex.
Assert this bit after probe to allow the RK3399 PCIe endpoint core to reply
to configuration requests from the root complex.
This is documented in section 17.5.8.1.2 of the RK3399 TRM.
Fixes: cf590b078391 ("PCI: rockchip: Add EP driver for Rockchip PCIe controller")
Cc: stable(a)vger.kernel.org
Signed-off-by: Rick Wertenbroek <rick.wertenbroek(a)gmail.com>
---
drivers/pci/controller/pcie-rockchip-ep.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/pci/controller/pcie-rockchip-ep.c b/drivers/pci/controller/pcie-rockchip-ep.c
index 9b835377b..4c84e403e 100644
--- a/drivers/pci/controller/pcie-rockchip-ep.c
+++ b/drivers/pci/controller/pcie-rockchip-ep.c
@@ -623,6 +623,8 @@ static int rockchip_pcie_ep_probe(struct platform_device *pdev)
ep->irq_pci_addr = ROCKCHIP_PCIE_EP_DUMMY_IRQ_ADDR;
+ rockchip_pcie_write(rockchip, PCIE_CLIENT_CONF_ENABLE, PCIE_CLIENT_CONFIG);
+
return 0;
err_epc_mem_exit:
pci_epc_mem_exit(epc);
--
2.25.1
This is the start of the stable review cycle for the 5.15.94 release.
There are 67 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 15 Feb 2023 14:46:51 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.94-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.15.94-rc1
Russell King (Oracle) <rmk+kernel(a)armlinux.org.uk>
nvmem: core: fix return value
Ville Syrjälä <ville.syrjala(a)linux.intel.com>
drm/i915: Fix VBT DSI DVO port handling
Aravind Iddamsetty <aravind.iddamsetty(a)intel.com>
drm/i915: Initialize the obj flags for shmem objects
Guilherme G. Piccoli <gpiccoli(a)igalia.com>
drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
David Chen <david.chen(a)nutanix.com>
Fix page corruption caused by racy check in __free_pages
Heiner Kallweit <hkallweit1(a)gmail.com>
arm64: dts: meson-axg: Make mmc host controller interrupts level-sensitive
Heiner Kallweit <hkallweit1(a)gmail.com>
arm64: dts: meson-g12-common: Make mmc host controller interrupts level-sensitive
Heiner Kallweit <hkallweit1(a)gmail.com>
arm64: dts: meson-gx: Make mmc host controller interrupts level-sensitive
Wander Lairson Costa <wander(a)redhat.com>
rtmutex: Ensure that the top waiter is always woken up
Nicholas Piggin <npiggin(a)gmail.com>
powerpc/64s/interrupt: Fix interrupt exit race with security mitigation switch
Guo Ren <guoren(a)linux.alibaba.com>
riscv: Fixup race condition on PG_dcache_clean in flush_icache_pte
Xiubo Li <xiubli(a)redhat.com>
ceph: flush cap releases when the session is flushed
Paul Cercueil <paul(a)crapouillou.net>
clk: ingenic: jz4760: Update M/N/OD calculation algorithm
Prashant Malani <pmalani(a)chromium.org>
usb: typec: altmodes/displayport: Fix probe pin assign check
Mark Pearson <mpearson-lenovo(a)squebb.ca>
usb: core: add quirk for Alcor Link AK9563 smartcard reader
Anand Jain <anand.jain(a)oracle.com>
btrfs: free device in btrfs_close_devices for a single device filesystem
Paolo Abeni <pabeni(a)redhat.com>
mptcp: be careful on subflow status propagation on errors
Alan Stern <stern(a)rowland.harvard.edu>
net: USB: Fix wrong-direction WARNING in plusb.c
ZhaoLong Wang <wangzhaolong1(a)huawei.com>
cifs: Fix use-after-free in rdata->read_into_pages()
Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
pinctrl: intel: Restore the pins that used to be in Direct IRQ mode
Serge Semin <Sergey.Semin(a)baikalelectronics.ru>
spi: dw: Fix wrong FIFO level setting for long xfers
Maxim Korotkov <korotkov.maxim.s(a)gmail.com>
pinctrl: single: fix potential NULL dereference
Joel Stanley <joel(a)jms.id.au>
pinctrl: aspeed: Fix confusing types in return value
Guodong Liu <Guodong.Liu(a)mediatek.com>
pinctrl: mediatek: Fix the drive register definition of some Pins
Amadeusz Sławiński <amadeuszx.slawinski(a)linux.intel.com>
ASoC: topology: Return -ENOMEM on memory allocation failure
Liu Shixin <liushixin2(a)huawei.com>
riscv: stacktrace: Fix missing the first frame
Dan Carpenter <error27(a)gmail.com>
ALSA: pci: lx6464es: fix a debug loop
Hangbin Liu <liuhangbin(a)gmail.com>
selftests: forwarding: lib: quote the sysctl values
Pietro Borrello <borrello(a)diag.uniroma1.it>
rds: rds_rm_zerocopy_callback() use list_first_entry()
Sasha Neftin <sasha.neftin(a)intel.com>
igc: Add ndo_tx_timeout support
Shay Drory <shayd(a)nvidia.com>
net/mlx5: Serialize module cleanup with reload and remove
Shay Drory <shayd(a)nvidia.com>
net/mlx5: fw_tracer, Zero consumer index when reloading the tracer
Shay Drory <shayd(a)nvidia.com>
net/mlx5: fw_tracer, Clear load bit when freeing string DBs buffers
Dragos Tatulea <dtatulea(a)nvidia.com>
net/mlx5e: IPoIB, Show unknown speed instead of error
Vlad Buslov <vladbu(a)nvidia.com>
net/mlx5: Bridge, fix ageing of peer FDB entries
Adham Faris <afaris(a)nvidia.com>
net/mlx5e: Update rx ring hw mtu upon each rx-fcs flag change
Maxim Mikityanskiy <maximmi(a)nvidia.com>
net/mlx5e: Introduce the mlx5e_flush_rq function
Maxim Mikityanskiy <maximmi(a)nvidia.com>
net/mlx5e: Move repeating clear_bit in mlx5e_rx_reporter_err_rq_cqe_recover
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: mscc: ocelot: fix VCAP filters not matching on MAC with "protocol 802.1Q"
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: dsa: mt7530: don't change PVC_EG_TAG when CPU port becomes VLAN-aware
Anirudh Venkataramanan <anirudh.venkataramanan(a)intel.com>
ice: Do not use WQ_MEM_RECLAIM flag for workqueue
Herton R. Krzesinski <herton(a)redhat.com>
uapi: add missing ip/ipv6 header dependencies for linux/stddef.h
Neel Patel <neel.patel(a)amd.com>
ionic: clean interrupt before enabling queue to avoid credit race
Heiner Kallweit <hkallweit1(a)gmail.com>
net: phy: meson-gxl: use MMD access dummy stubs for GXL, internal PHY
Qi Zheng <zhengqi.arch(a)bytedance.com>
bonding: fix error checking in bond_debug_reregister()
Clément Léger <clement.leger(a)bootlin.com>
net: phylink: move phy_device_free() to correctly release phy device
Christian Hopps <chopps(a)chopps.org>
xfrm: fix bug with DSCP copy to v6 from v4 tunnel
Yang Yingliang <yangyingliang(a)huawei.com>
RDMA/usnic: use iommu_map_atomic() under spin_lock()
Nikita Zhandarovich <n.zhandarovich(a)fintech.ru>
RDMA/irdma: Fix potential NULL-ptr-dereference
Dragos Tatulea <dtatulea(a)nvidia.com>
IB/IPoIB: Fix legacy IPoIB due to wrong number of queues
Eric Dumazet <edumazet(a)google.com>
xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
Dean Luick <dean.luick(a)cornelisnetworks.com>
IB/hfi1: Restore allocated resources on failed copyout
Anastasia Belova <abelova(a)astralinux.ru>
xfrm: compat: change expression for switch in xfrm_xlate64
Devid Antonio Filoni <devid.filoni(a)egluetechnologies.com>
can: j1939: do not wait 250 ms if the same addr was already claimed
Mark Brown <broonie(a)kernel.org>
of/address: Return an error when no valid dma-ranges are found
Shiju Jose <shiju.jose(a)huawei.com>
tracing: Fix poll() and select() do not work on per_cpu trace_pipe and trace_pipe_raw
Elvis Angelaccio <elvis.angelaccio(a)kde.org>
ALSA: hda/realtek: Enable mute/micmute LEDs on HP Elitebook, 645 G9
Guillaume Pinot <texitoi(a)texitoi.eu>
ALSA: hda/realtek: Fix the speaker output on Samsung Galaxy Book2 Pro 360
Artemii Karasev <karasev(a)ispras.ru>
ALSA: emux: Avoid potential array out-of-bound in snd_emux_xg_control()
Edson Juliano Drosdeck <edson.drosdeck(a)gmail.com>
ALSA: hda/realtek: Add Positivo N14KP6-TG
Alexander Potapenko <glider(a)google.com>
btrfs: zlib: zero-initialize zlib workspace
Josef Bacik <josef(a)toxicpanda.com>
btrfs: limit device extents to the device size
Mike Kravetz <mike.kravetz(a)oracle.com>
migrate: hugetlb: check for hugetlb shared PMD in node migration
Miaohe Lin <linmiaohe(a)huawei.com>
mm/migration: return errno when isolate_huge_page failed
Russell King (Oracle) <rmk+kernel(a)armlinux.org.uk>
nvmem: core: fix registration vs use race
Russell King (Oracle) <rmk+kernel(a)armlinux.org.uk>
nvmem: core: fix cleanup after dev_set_name()
Gaosheng Cui <cuigaosheng1(a)huawei.com>
nvmem: core: add error handling for dev_set_name
-------------
Diffstat:
Makefile | 4 +-
arch/arm64/boot/dts/amlogic/meson-axg.dtsi | 4 +-
arch/arm64/boot/dts/amlogic/meson-g12-common.dtsi | 6 +-
arch/arm64/boot/dts/amlogic/meson-gx.dtsi | 6 +-
arch/powerpc/kernel/interrupt.c | 6 +-
arch/riscv/kernel/stacktrace.c | 3 +-
arch/riscv/mm/cacheflush.c | 4 +-
drivers/clk/ingenic/jz4760-cgu.c | 18 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +-
drivers/gpu/drm/i915/display/intel_bios.c | 33 +++++---
drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 2 +-
drivers/infiniband/hw/hfi1/file_ops.c | 7 +-
drivers/infiniband/hw/irdma/cm.c | 3 +
drivers/infiniband/hw/usnic/usnic_uiom.c | 8 +-
drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 ++
drivers/net/bonding/bond_debugfs.c | 2 +-
drivers/net/dsa/mt7530.c | 26 ++++--
drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
drivers/net/ethernet/intel/igc/igc_main.c | 25 +++++-
.../ethernet/mellanox/mlx5/core/diag/fw_tracer.c | 3 +-
drivers/net/ethernet/mellanox/mlx5/core/en.h | 2 +-
.../ethernet/mellanox/mlx5/core/en/rep/bridge.c | 4 -
.../ethernet/mellanox/mlx5/core/en/reporter_rx.c | 30 +------
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 98 ++++++++--------------
.../net/ethernet/mellanox/mlx5/core/esw/bridge.c | 2 +-
.../ethernet/mellanox/mlx5/core/ipoib/ethtool.c | 13 ++-
drivers/net/ethernet/mellanox/mlx5/core/main.c | 14 ++--
drivers/net/ethernet/mscc/ocelot_flower.c | 24 +++---
drivers/net/ethernet/pensando/ionic/ionic_lif.c | 15 +++-
drivers/net/phy/meson-gxl.c | 2 +
drivers/net/phy/phylink.c | 5 +-
drivers/net/usb/plusb.c | 4 +-
drivers/nvmem/core.c | 41 ++++-----
drivers/of/address.c | 21 +++--
drivers/pinctrl/aspeed/pinctrl-aspeed.c | 2 +-
drivers/pinctrl/intel/pinctrl-intel.c | 16 +++-
drivers/pinctrl/mediatek/pinctrl-mt8195.c | 4 +-
drivers/pinctrl/pinctrl-single.c | 2 +
drivers/spi/spi-dw-core.c | 2 +-
drivers/usb/core/quirks.c | 3 +
drivers/usb/typec/altmodes/displayport.c | 8 +-
fs/btrfs/volumes.c | 22 ++++-
fs/btrfs/zlib.c | 2 +-
fs/ceph/mds_client.c | 6 ++
fs/cifs/file.c | 4 +-
include/linux/hugetlb.h | 6 +-
include/uapi/linux/ip.h | 1 +
include/uapi/linux/ipv6.h | 1 +
kernel/locking/rtmutex.c | 5 +-
kernel/trace/trace.c | 3 -
mm/gup.c | 2 +-
mm/hugetlb.c | 11 ++-
mm/memory-failure.c | 2 +-
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 5 +-
mm/migrate.c | 7 +-
mm/page_alloc.c | 5 +-
net/can/j1939/address-claim.c | 40 +++++++++
net/mptcp/subflow.c | 10 ++-
net/rds/message.c | 6 +-
net/xfrm/xfrm_compat.c | 4 +-
net/xfrm/xfrm_input.c | 3 +-
sound/pci/hda/patch_realtek.c | 3 +
sound/pci/lx6464es/lx_core.c | 11 ++-
sound/soc/soc-topology.c | 8 +-
sound/synth/emux/emux_nrpn.c | 3 +
tools/testing/selftests/net/forwarding/lib.sh | 4 +-
67 files changed, 398 insertions(+), 268 deletions(-)
changes since v2:
- The code was tottaly rewritten based on the disscution of the
v2 patch.
- the ssid is set in __cfg80211_connect_result() and only if the ssid is
not already set.
- Do not add an other ssid reset path since it is already done in
__cfg80211_disconnected()
When a connexion was established without going through
NL80211_CMD_CONNECT, the ssid was never set in the wireless_dev struct.
Now we set it in __cfg80211_connect_result() when it is not already set.
Reported-by: Yohan Prod'homme <kernel(a)zoddo.fr>
Fixes: 7b0a0e3c3a88260b6fcb017e49f198463aa62ed1
Cc: linux-wireless(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216711
Signed-off-by: Marc Bornand <dev.mbornand(a)systemb.ch>
---
net/wireless/sme.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/net/wireless/sme.c b/net/wireless/sme.c
index 4b5b6ee0fe01..629d7b5f65c1 100644
--- a/net/wireless/sme.c
+++ b/net/wireless/sme.c
@@ -723,6 +723,7 @@ void __cfg80211_connect_result(struct net_device *dev,
bool wextev)
{
struct wireless_dev *wdev = dev->ieee80211_ptr;
+ const struct element *ssid;
const struct element *country_elem = NULL;
const u8 *country_data;
u8 country_datalen;
@@ -883,6 +884,21 @@ void __cfg80211_connect_result(struct net_device *dev,
country_data, country_datalen);
kfree(country_data);
+ if (wdev->u.client.ssid_len == 0) {
+ rcu_read_lock();
+ for_each_valid_link(cr, link) {
+ ssid = ieee80211_bss_get_elem(cr->links[link].bss,
+ WLAN_EID_SSID);
+
+ if (ssid->datalen == 0)
+ continue;
+
+ memcpy(wdev->u.client.ssid, ssid->data, ssid->datalen);
+ wdev->u.client.ssid_len = ssid->datalen;
+ }
+ rcu_read_unlock();
+ }
+
return;
out:
for_each_valid_link(cr, link)
--
2.39.1
Add the MST topology for a CRTC to the atomic state if the driver
needs to force a modeset on the CRTC after the encoder compute config
functions are called.
Later the MST encoder's disable hook also adds the state, but that isn't
guaranteed to work (since in that hook getting the state may fail, which
can't be handled there). This should fix that, while a later patch fixes
the use of the MST state in the disable hook.
v2: Add missing forward struct declartions, caught by hdrtest.
v3: Factor out intel_dp_mst_add_topology_state_for_connector() used
later in the patchset.
Cc: Lyude Paul <lyude(a)redhat.com>
Cc: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Cc: stable(a)vger.kernel.org # 6.1
Reviewed-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com> # v2
Reviewed-by: Lyude Paul <lyude(a)redhat.com>
Signed-off-by: Imre Deak <imre.deak(a)intel.com>
---
drivers/gpu/drm/i915/display/intel_display.c | 4 ++
drivers/gpu/drm/i915/display/intel_dp_mst.c | 61 ++++++++++++++++++++
drivers/gpu/drm/i915/display/intel_dp_mst.h | 4 ++
3 files changed, 69 insertions(+)
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index 166662ade593c..38106cf63b3b9 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -5936,6 +5936,10 @@ int intel_modeset_all_pipes(struct intel_atomic_state *state,
if (ret)
return ret;
+ ret = intel_dp_mst_add_topology_state_for_crtc(state, crtc);
+ if (ret)
+ return ret;
+
ret = intel_atomic_add_affected_planes(state, crtc);
if (ret)
return ret;
diff --git a/drivers/gpu/drm/i915/display/intel_dp_mst.c b/drivers/gpu/drm/i915/display/intel_dp_mst.c
index 8b0e4defa3f10..f3cb12dcfe0a7 100644
--- a/drivers/gpu/drm/i915/display/intel_dp_mst.c
+++ b/drivers/gpu/drm/i915/display/intel_dp_mst.c
@@ -1223,3 +1223,64 @@ bool intel_dp_mst_is_slave_trans(const struct intel_crtc_state *crtc_state)
return crtc_state->mst_master_transcoder != INVALID_TRANSCODER &&
crtc_state->mst_master_transcoder != crtc_state->cpu_transcoder;
}
+
+/**
+ * intel_dp_mst_add_topology_state_for_connector - add MST topology state for a connector
+ * @state: atomic state
+ * @connector: connector to add the state for
+ * @crtc: the CRTC @connector is attached to
+ *
+ * Add the MST topology state for @connector to @state.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+static int
+intel_dp_mst_add_topology_state_for_connector(struct intel_atomic_state *state,
+ struct intel_connector *connector,
+ struct intel_crtc *crtc)
+{
+ struct drm_dp_mst_topology_state *mst_state;
+
+ if (!connector->mst_port)
+ return 0;
+
+ mst_state = drm_atomic_get_mst_topology_state(&state->base,
+ &connector->mst_port->mst_mgr);
+ if (IS_ERR(mst_state))
+ return PTR_ERR(mst_state);
+
+ mst_state->pending_crtc_mask |= drm_crtc_mask(&crtc->base);
+
+ return 0;
+}
+
+/**
+ * intel_dp_mst_add_topology_state_for_crtc - add MST topology state for a CRTC
+ * @state: atomic state
+ * @crtc: CRTC to add the state for
+ *
+ * Add the MST topology state for @crtc to @state.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+int intel_dp_mst_add_topology_state_for_crtc(struct intel_atomic_state *state,
+ struct intel_crtc *crtc)
+{
+ struct drm_connector *_connector;
+ struct drm_connector_state *conn_state;
+ int i;
+
+ for_each_new_connector_in_state(&state->base, _connector, conn_state, i) {
+ struct intel_connector *connector = to_intel_connector(_connector);
+ int ret;
+
+ if (conn_state->crtc != &crtc->base)
+ continue;
+
+ ret = intel_dp_mst_add_topology_state_for_connector(state, connector, crtc);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
diff --git a/drivers/gpu/drm/i915/display/intel_dp_mst.h b/drivers/gpu/drm/i915/display/intel_dp_mst.h
index f7301de6cdfb3..f1815bb722672 100644
--- a/drivers/gpu/drm/i915/display/intel_dp_mst.h
+++ b/drivers/gpu/drm/i915/display/intel_dp_mst.h
@@ -8,6 +8,8 @@
#include <linux/types.h>
+struct intel_atomic_state;
+struct intel_crtc;
struct intel_crtc_state;
struct intel_digital_port;
struct intel_dp;
@@ -18,5 +20,7 @@ int intel_dp_mst_encoder_active_links(struct intel_digital_port *dig_port);
bool intel_dp_mst_is_master_trans(const struct intel_crtc_state *crtc_state);
bool intel_dp_mst_is_slave_trans(const struct intel_crtc_state *crtc_state);
bool intel_dp_mst_source_support(struct intel_dp *intel_dp);
+int intel_dp_mst_add_topology_state_for_crtc(struct intel_atomic_state *state,
+ struct intel_crtc *crtc);
#endif /* __INTEL_DP_MST_H__ */
--
2.37.1
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: d125d1349abeb46945dc5e98f7824bf688266f13
Gitweb: https://git.kernel.org/tip/d125d1349abeb46945dc5e98f7824bf688266f13
Author: Thomas Gleixner <tglx(a)linutronix.de>
AuthorDate: Thu, 09 Feb 2023 23:25:49 +01:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Tue, 14 Feb 2023 11:18:35 +01:00
alarmtimer: Prevent starvation by small intervals and SIG_IGN
syzbot reported a RCU stall which is caused by setting up an alarmtimer
with a very small interval and ignoring the signal. The reproducer arms the
alarm timer with a relative expiry of 8ns and an interval of 9ns. Not a
problem per se, but that's an issue when the signal is ignored because then
the timer is immediately rearmed because there is no way to delay that
rearming to the signal delivery path. See posix_timer_fn() and commit
58229a189942 ("posix-timers: Prevent softirq starvation by small intervals
and SIG_IGN") for details.
The reproducer does not set SIG_IGN explicitely, but it sets up the timers
signal with SIGCONT. That has the same effect as explicitely setting
SIG_IGN for a signal as SIGCONT is ignored if there is no handler set and
the task is not ptraced.
The log clearly shows that:
[pid 5102] --- SIGCONT {si_signo=SIGCONT, si_code=SI_TIMER, si_timerid=0, si_overrun=316014, si_int=0, si_ptr=NULL} ---
It works because the tasks are traced and therefore the signal is queued so
the tracer can see it, which delays the restart of the timer to the signal
delivery path. But then the tracer is killed:
[pid 5087] kill(-5102, SIGKILL <unfinished ...>
...
./strace-static-x86_64: Process 5107 detached
and after it's gone the stall can be observed:
syzkaller login: [ 79.439102][ C0] hrtimer: interrupt took 68471 ns
[ 184.460538][ C1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
...
[ 184.658237][ C1] rcu: Stack dump where RCU GP kthread last ran:
[ 184.664574][ C1] Sending NMI from CPU 1 to CPUs 0:
[ 184.669821][ C0] NMI backtrace for cpu 0
[ 184.669831][ C0] CPU: 0 PID: 5108 Comm: syz-executor192 Not tainted 6.2.0-rc6-next-20230203-syzkaller #0
...
[ 184.670036][ C0] Call Trace:
[ 184.670041][ C0] <IRQ>
[ 184.670045][ C0] alarmtimer_fired+0x327/0x670
posix_timer_fn() prevents that by checking whether the interval for
timers which have the signal ignored is smaller than a jiffie and
artifically delay it by shifting the next expiry out by a jiffie. That's
accurate vs. the overrun accounting, but slightly inaccurate
vs. timer_gettimer(2).
The comment in that function says what needs to be done and there was a fix
available for the regular userspace induced SIG_IGN mechanism, but that did
not work due to the implicit ignore for SIGCONT and similar signals. This
needs to be worked on, but for now the only available workaround is to do
exactly what posix_timer_fn() does:
Increase the interval of self-rearming timers, which have their signal
ignored, to at least a jiffie.
Interestingly this has been fixed before via commit ff86bf0c65f1
("alarmtimer: Rate limit periodic intervals") already, but that fix got
lost in a later rework.
Reported-by: syzbot+b9564ba6e8e00694511b(a)syzkaller.appspotmail.com
Fixes: f2c45807d399 ("alarmtimer: Switch over to generic set/get/rearm routine")
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: John Stultz <jstultz(a)google.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/87k00q1no2.ffs@tglx
---
kernel/time/alarmtimer.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 5897828..7e5dff6 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -470,11 +470,35 @@ u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval)
}
EXPORT_SYMBOL_GPL(alarm_forward);
-u64 alarm_forward_now(struct alarm *alarm, ktime_t interval)
+static u64 __alarm_forward_now(struct alarm *alarm, ktime_t interval, bool throttle)
{
struct alarm_base *base = &alarm_bases[alarm->type];
+ ktime_t now = base->get_ktime();
+
+ if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && throttle) {
+ /*
+ * Same issue as with posix_timer_fn(). Timers which are
+ * periodic but the signal is ignored can starve the system
+ * with a very small interval. The real fix which was
+ * promised in the context of posix_timer_fn() never
+ * materialized, but someone should really work on it.
+ *
+ * To prevent DOS fake @now to be 1 jiffie out which keeps
+ * the overrun accounting correct but creates an
+ * inconsistency vs. timer_gettime(2).
+ */
+ ktime_t kj = NSEC_PER_SEC / HZ;
+
+ if (interval < kj)
+ now = ktime_add(now, kj);
+ }
+
+ return alarm_forward(alarm, now, interval);
+}
- return alarm_forward(alarm, base->get_ktime(), interval);
+u64 alarm_forward_now(struct alarm *alarm, ktime_t interval)
+{
+ return __alarm_forward_now(alarm, interval, false);
}
EXPORT_SYMBOL_GPL(alarm_forward_now);
@@ -551,9 +575,10 @@ static enum alarmtimer_restart alarm_handle_timer(struct alarm *alarm,
if (posix_timer_event(ptr, si_private) && ptr->it_interval) {
/*
* Handle ignored signals and rearm the timer. This will go
- * away once we handle ignored signals proper.
+ * away once we handle ignored signals proper. Ensure that
+ * small intervals cannot starve the system.
*/
- ptr->it_overrun += alarm_forward_now(alarm, ptr->it_interval);
+ ptr->it_overrun += __alarm_forward_now(alarm, ptr->it_interval, true);
++ptr->it_requeue_pending;
ptr->it_active = 1;
result = ALARMTIMER_RESTART;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
73bdf65ea748 ("migrate: hugetlb: check for hugetlb shared PMD in node migration")
7ce82f4c3f3e ("mm/migration: return errno when isolate_huge_page failed")
1b7f7e58decc ("mm/gup: Convert check_and_migrate_movable_pages() to use a folio")
f9f38f78c5d5 ("mm: refactor check_and_migrate_movable_pages")
5ac95884a784 ("mm/migrate: enable returning precise migrate_pages() success count")
c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling")
5db4f15c4fd7 ("mm: memory: add orig_pmd to struct vm_fault")
8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork()")
25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
f68749ec342b ("mm/gup: longterm pin migration cleanup")
d1e153fea2a8 ("mm/gup: migrate pinned pages out of movable zone")
1a08ae36cf8b ("mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN")
6e7f34ebb8d2 ("mm/gup: check for isolation errors")
f0f4463837da ("mm/gup: return an error on migration failure")
83c02c23d074 ("mm/gup: check every subpage of a compound page during isolation")
c991ffef7bce ("mm/gup: don't pin migrated cma pages in movable zone")
7ee820ee7238 ("Revert "mm: migrate: skip shared exec THP for NUMA balancing"")
ae37c7ff79f1 ("mm: make alloc_contig_range handle in-use hugetlb pages")
369fa227c219 ("mm: make alloc_contig_range handle free hugetlb pages")
c2ad7a1ffeaf ("mm,compaction: let isolate_migratepages_{range,block} return error codes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 73bdf65ea74857d7fb2ec3067a3cec0e261b1462 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Thu, 26 Jan 2023 14:27:21 -0800
Subject: [PATCH] migrate: hugetlb: check for hugetlb shared PMD in node
migration
migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
move pages shared with another process to a different node. page_mapcount
> 1 is being used to determine if a hugetlb page is shared. However, a
hugetlb page will have a mapcount of 1 if mapped by multiple processes via
a shared PMD. As a result, hugetlb pages shared by multiple processes and
mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.
To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found
consider the page shared.
Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com
Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: James Houghton <jthoughton(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..f940395667c8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -600,7 +600,8 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
- (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
+ (flags & MPOL_MF_MOVE && page_mapcount(page) == 1 &&
+ !hugetlb_pmd_shared(pte))) {
if (isolate_hugetlb(page, qp->pagelist) &&
(flags & MPOL_MF_STRICT))
/*
From: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
It's unusual that we have enumeration by class in the middle of the table.
It might potentially be problematic in the future if we add another entry
after it.
So, move class matching entry to be the last in the ID table.
[ Upstream commit 0b85f59d30b91bd2b93ea7ef0816a4b7e7039e8c ]
Without this change, quirks set in driver_data added after the catch-all
are ignored.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Reviewed-by: Keith Busch <kbusch(a)kernel.org>
Reviewed-by: Sagi Grimberg <sagi(a)grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni(a)wdc.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Gwendal Grignou <gwendal(a)chromium.org>
---
drivers/nvme/host/pci.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5d62d1042c0e6..a58711c488509 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3199,7 +3199,6 @@ static const struct pci_device_id nvme_id_table[] = {
NVME_QUIRK_IGNORE_DEV_SUBNQN, },
{ PCI_DEVICE(0x1c5c, 0x1504), /* SK Hynix PC400 */
.driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
- { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
{ PCI_DEVICE(0x2646, 0x2263), /* KINGSTON A2000 NVMe SSD */
.driver_data = NVME_QUIRK_NO_DEEPEST_PS, },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001),
@@ -3209,6 +3208,8 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_SINGLE_VECTOR |
NVME_QUIRK_128_BYTES_SQES |
NVME_QUIRK_SHARED_TAGS },
+
+ { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
{ 0, }
};
MODULE_DEVICE_TABLE(pci, nvme_id_table);
--
2.39.1.519.gcb327c4b5f-goog
Mark the Tiger Lake UP{3,4} AHCI controller as "low_power". This enables
S0ix to work out of the box. Otherwise this isn't working unless the
user manually sets /sys/class/scsi_host/*/link_power_management_policy.
Intel lists a total of 4 SATA controller IDs in [1] for those mobile
PCHs. This commit just adds the "AHCI" variant since I only tested
those.
[1]: https://cdrdv2.intel.com/v1/dl/getContent/631119
Signed-off-by: Simon Gaiser <simon(a)invisiblethingslab.com>
CC: stable(a)vger.kernel.org
---
As noted above this doesn't include the other PCI IDs listed by Intel
for those PCHs (RAID modes). Also the same is probably needed for newer
generations. But for both I don't have hardware to test handy right now,
so only included what I have actually tested.
Added stable to CC, since on systems using S0ix this prevents S0ix
residency and therefore leads to such high power consumption that
suspend is effectively broken.
drivers/ata/ahci.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 14a1c0d14916..3bb9bb483fe3 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -421,6 +421,7 @@ static const struct pci_device_id ahci_pci_tbl[] = {
{ PCI_VDEVICE(INTEL, 0x34d3), board_ahci_low_power }, /* Ice Lake LP AHCI */
{ PCI_VDEVICE(INTEL, 0x02d3), board_ahci_low_power }, /* Comet Lake PCH-U AHCI */
{ PCI_VDEVICE(INTEL, 0x02d7), board_ahci_low_power }, /* Comet Lake PCH RAID */
+ { PCI_VDEVICE(INTEL, 0xa0d3), board_ahci_low_power }, /* Tiger Lake UP{3,4} AHCI */
/* JMicron 360/1/3/5/6, match class to avoid IDE function */
{ PCI_VENDOR_ID_JMICRON, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID,
--
2.39.1
A hugetlb page will have a mapcount of 1 if mapped by multiple processes
via a shared PMD. This is because only the first process increases the
map count, and subsequent processes just add the shared PMD page to
their page table.
page_mapcount is being used to decide if a hugetlb page is shared or
private in /proc/PID/smaps. Pages referenced via a shared PMD were
incorrectly being counted as private.
To fix, check for a shared PMD if mapcount is 1. If a shared PMD is
found count the hugetlb page as shared. A new helper to check for a
shared PMD is added.
Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
Cc: stable(a)vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
fs/proc/task_mmu.c | 10 ++++++++--
include/linux/hugetlb.h | 12 ++++++++++++
2 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..cb9539879402 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -749,8 +749,14 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
if (mapcount >= 2)
mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
- else
- mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+ else {
+ if (hugetlb_pmd_shared(pte))
+ mss->shared_hugetlb +=
+ huge_page_size(hstate_vma(vma));
+ else
+ mss->private_hugetlb +=
+ huge_page_size(hstate_vma(vma));
+ }
}
return 0;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e3aa336df900..8e65920e4363 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1225,6 +1225,18 @@ static inline __init void hugetlb_cma_reserve(int order)
}
#endif
+#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
+static inline bool hugetlb_pmd_shared(pte_t *pte)
+{
+ return page_count(virt_to_page(pte)) > 1;
+}
+#else
+static inline bool hugetlb_pmd_shared(pte_t *pte)
+{
+ return false;
+}
+#endif
+
bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
#ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
--
2.39.1
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
73bdf65ea748 ("migrate: hugetlb: check for hugetlb shared PMD in node migration")
7ce82f4c3f3e ("mm/migration: return errno when isolate_huge_page failed")
1b7f7e58decc ("mm/gup: Convert check_and_migrate_movable_pages() to use a folio")
f9f38f78c5d5 ("mm: refactor check_and_migrate_movable_pages")
5ac95884a784 ("mm/migrate: enable returning precise migrate_pages() success count")
c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling")
5db4f15c4fd7 ("mm: memory: add orig_pmd to struct vm_fault")
8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork()")
25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
f68749ec342b ("mm/gup: longterm pin migration cleanup")
d1e153fea2a8 ("mm/gup: migrate pinned pages out of movable zone")
1a08ae36cf8b ("mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN")
6e7f34ebb8d2 ("mm/gup: check for isolation errors")
f0f4463837da ("mm/gup: return an error on migration failure")
83c02c23d074 ("mm/gup: check every subpage of a compound page during isolation")
c991ffef7bce ("mm/gup: don't pin migrated cma pages in movable zone")
7ee820ee7238 ("Revert "mm: migrate: skip shared exec THP for NUMA balancing"")
ae37c7ff79f1 ("mm: make alloc_contig_range handle in-use hugetlb pages")
369fa227c219 ("mm: make alloc_contig_range handle free hugetlb pages")
c2ad7a1ffeaf ("mm,compaction: let isolate_migratepages_{range,block} return error codes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 73bdf65ea74857d7fb2ec3067a3cec0e261b1462 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Thu, 26 Jan 2023 14:27:21 -0800
Subject: [PATCH] migrate: hugetlb: check for hugetlb shared PMD in node
migration
migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
move pages shared with another process to a different node. page_mapcount
> 1 is being used to determine if a hugetlb page is shared. However, a
hugetlb page will have a mapcount of 1 if mapped by multiple processes via
a shared PMD. As a result, hugetlb pages shared by multiple processes and
mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.
To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found
consider the page shared.
Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com
Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: James Houghton <jthoughton(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..f940395667c8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -600,7 +600,8 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
- (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
+ (flags & MPOL_MF_MOVE && page_mapcount(page) == 1 &&
+ !hugetlb_pmd_shared(pte))) {
if (isolate_hugetlb(page, qp->pagelist) &&
(flags & MPOL_MF_STRICT))
/*
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
73bdf65ea748 ("migrate: hugetlb: check for hugetlb shared PMD in node migration")
7ce82f4c3f3e ("mm/migration: return errno when isolate_huge_page failed")
1b7f7e58decc ("mm/gup: Convert check_and_migrate_movable_pages() to use a folio")
f9f38f78c5d5 ("mm: refactor check_and_migrate_movable_pages")
5ac95884a784 ("mm/migrate: enable returning precise migrate_pages() success count")
c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling")
5db4f15c4fd7 ("mm: memory: add orig_pmd to struct vm_fault")
8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork()")
25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
f68749ec342b ("mm/gup: longterm pin migration cleanup")
d1e153fea2a8 ("mm/gup: migrate pinned pages out of movable zone")
1a08ae36cf8b ("mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN")
6e7f34ebb8d2 ("mm/gup: check for isolation errors")
f0f4463837da ("mm/gup: return an error on migration failure")
83c02c23d074 ("mm/gup: check every subpage of a compound page during isolation")
c991ffef7bce ("mm/gup: don't pin migrated cma pages in movable zone")
7ee820ee7238 ("Revert "mm: migrate: skip shared exec THP for NUMA balancing"")
ae37c7ff79f1 ("mm: make alloc_contig_range handle in-use hugetlb pages")
369fa227c219 ("mm: make alloc_contig_range handle free hugetlb pages")
c2ad7a1ffeaf ("mm,compaction: let isolate_migratepages_{range,block} return error codes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 73bdf65ea74857d7fb2ec3067a3cec0e261b1462 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Thu, 26 Jan 2023 14:27:21 -0800
Subject: [PATCH] migrate: hugetlb: check for hugetlb shared PMD in node
migration
migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
move pages shared with another process to a different node. page_mapcount
> 1 is being used to determine if a hugetlb page is shared. However, a
hugetlb page will have a mapcount of 1 if mapped by multiple processes via
a shared PMD. As a result, hugetlb pages shared by multiple processes and
mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.
To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found
consider the page shared.
Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com
Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: James Houghton <jthoughton(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..f940395667c8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -600,7 +600,8 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
- (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
+ (flags & MPOL_MF_MOVE && page_mapcount(page) == 1 &&
+ !hugetlb_pmd_shared(pte))) {
if (isolate_hugetlb(page, qp->pagelist) &&
(flags & MPOL_MF_STRICT))
/*
The global irq_domain_mutex is held when mapping interrupts from
non-hierarchical domains but currently not when disposing them.
This specifically means that updates of the domain mapcount is racy
(currently only used for statistics in debugfs).
Make sure to hold the global irq_domain_mutex also when disposing
mappings from non-hierarchical domains.
Fixes: 9dc6be3d4193 ("genirq/irqdomain: Add map counter")
Cc: stable(a)vger.kernel.org # 4.13
Tested-by: Hsin-Yi Wang <hsinyi(a)chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai(a)mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
kernel/irq/irqdomain.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 561689a3f050..981cd636275e 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -538,6 +538,9 @@ static void irq_domain_disassociate(struct irq_domain *domain, unsigned int irq)
return;
hwirq = irq_data->hwirq;
+
+ mutex_lock(&irq_domain_mutex);
+
irq_set_status_flags(irq, IRQ_NOREQUEST);
/* remove chip and handler */
@@ -557,6 +560,8 @@ static void irq_domain_disassociate(struct irq_domain *domain, unsigned int irq)
/* Clear reverse map for this hwirq */
irq_domain_clear_mapping(domain, hwirq);
+
+ mutex_unlock(&irq_domain_mutex);
}
static int irq_domain_associate_locked(struct irq_domain *domain, unsigned int virq,
--
2.39.1
Refactor __irq_domain_alloc_irqs() so that it can be called internally
while holding the irq_domain_mutex.
This will be used to fix a shared-interrupt mapping race, hence the
Fixes tag.
Fixes: b62b2cf5759b ("irqdomain: Fix handling of type settings for existing mappings")
Cc: stable(a)vger.kernel.org # 4.8
Tested-by: Hsin-Yi Wang <hsinyi(a)chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai(a)mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
kernel/irq/irqdomain.c | 88 +++++++++++++++++++++++-------------------
1 file changed, 48 insertions(+), 40 deletions(-)
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 3d6a14efae62..7b57949bc79c 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -1441,40 +1441,12 @@ int irq_domain_alloc_irqs_hierarchy(struct irq_domain *domain,
return domain->ops->alloc(domain, irq_base, nr_irqs, arg);
}
-/**
- * __irq_domain_alloc_irqs - Allocate IRQs from domain
- * @domain: domain to allocate from
- * @irq_base: allocate specified IRQ number if irq_base >= 0
- * @nr_irqs: number of IRQs to allocate
- * @node: NUMA node id for memory allocation
- * @arg: domain specific argument
- * @realloc: IRQ descriptors have already been allocated if true
- * @affinity: Optional irq affinity mask for multiqueue devices
- *
- * Allocate IRQ numbers and initialized all data structures to support
- * hierarchy IRQ domains.
- * Parameter @realloc is mainly to support legacy IRQs.
- * Returns error code or allocated IRQ number
- *
- * The whole process to setup an IRQ has been split into two steps.
- * The first step, __irq_domain_alloc_irqs(), is to allocate IRQ
- * descriptor and required hardware resources. The second step,
- * irq_domain_activate_irq(), is to program the hardware with preallocated
- * resources. In this way, it's easier to rollback when failing to
- * allocate resources.
- */
-int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
- unsigned int nr_irqs, int node, void *arg,
- bool realloc, const struct irq_affinity_desc *affinity)
+static int irq_domain_alloc_irqs_locked(struct irq_domain *domain, int irq_base,
+ unsigned int nr_irqs, int node, void *arg,
+ bool realloc, const struct irq_affinity_desc *affinity)
{
int i, ret, virq;
- if (domain == NULL) {
- domain = irq_default_domain;
- if (WARN(!domain, "domain is NULL; cannot allocate IRQ\n"))
- return -EINVAL;
- }
-
if (realloc && irq_base >= 0) {
virq = irq_base;
} else {
@@ -1493,24 +1465,18 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
goto out_free_desc;
}
- mutex_lock(&irq_domain_mutex);
ret = irq_domain_alloc_irqs_hierarchy(domain, virq, nr_irqs, arg);
- if (ret < 0) {
- mutex_unlock(&irq_domain_mutex);
+ if (ret < 0)
goto out_free_irq_data;
- }
for (i = 0; i < nr_irqs; i++) {
ret = irq_domain_trim_hierarchy(virq + i);
- if (ret) {
- mutex_unlock(&irq_domain_mutex);
+ if (ret)
goto out_free_irq_data;
- }
}
-
+
for (i = 0; i < nr_irqs; i++)
irq_domain_insert_irq(virq + i);
- mutex_unlock(&irq_domain_mutex);
return virq;
@@ -1520,6 +1486,48 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
irq_free_descs(virq, nr_irqs);
return ret;
}
+
+/**
+ * __irq_domain_alloc_irqs - Allocate IRQs from domain
+ * @domain: domain to allocate from
+ * @irq_base: allocate specified IRQ number if irq_base >= 0
+ * @nr_irqs: number of IRQs to allocate
+ * @node: NUMA node id for memory allocation
+ * @arg: domain specific argument
+ * @realloc: IRQ descriptors have already been allocated if true
+ * @affinity: Optional irq affinity mask for multiqueue devices
+ *
+ * Allocate IRQ numbers and initialized all data structures to support
+ * hierarchy IRQ domains.
+ * Parameter @realloc is mainly to support legacy IRQs.
+ * Returns error code or allocated IRQ number
+ *
+ * The whole process to setup an IRQ has been split into two steps.
+ * The first step, __irq_domain_alloc_irqs(), is to allocate IRQ
+ * descriptor and required hardware resources. The second step,
+ * irq_domain_activate_irq(), is to program the hardware with preallocated
+ * resources. In this way, it's easier to rollback when failing to
+ * allocate resources.
+ */
+int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
+ unsigned int nr_irqs, int node, void *arg,
+ bool realloc, const struct irq_affinity_desc *affinity)
+{
+ int ret;
+
+ if (domain == NULL) {
+ domain = irq_default_domain;
+ if (WARN(!domain, "domain is NULL; cannot allocate IRQ\n"))
+ return -EINVAL;
+ }
+
+ mutex_lock(&irq_domain_mutex);
+ ret = irq_domain_alloc_irqs_locked(domain, irq_base, nr_irqs, node, arg,
+ realloc, affinity);
+ mutex_unlock(&irq_domain_mutex);
+
+ return ret;
+}
EXPORT_SYMBOL_GPL(__irq_domain_alloc_irqs);
/* The irq_data was moved, fix the revmap to refer to the new location */
--
2.39.1
In case a newly allocated IRQ ever ends up not having any associated
struct irq_data it would not even be possible to dispose the mapping.
Replace the bogus disposal with a WARN_ON().
This will also be used to fix a shared-interrupt mapping race, hence the
CC-stable tag.
Fixes: 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ")
Cc: stable(a)vger.kernel.org # 4.8
Tested-by: Hsin-Yi Wang <hsinyi(a)chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai(a)mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
kernel/irq/irqdomain.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 981cd636275e..b4326c364ae7 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -847,13 +847,8 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec)
}
irq_data = irq_get_irq_data(virq);
- if (!irq_data) {
- if (irq_domain_is_hierarchy(domain))
- irq_domain_free_irqs(virq, 1);
- else
- irq_dispose_mapping(virq);
+ if (WARN_ON(!irq_data))
return 0;
- }
/* Store trigger type */
irqd_set_trigger_type(irq_data, type);
--
2.39.1
From: Marc Zyngier <maz(a)kernel.org>
Hierarchical domains created using irq_domain_create_hierarchy() are
currently added to the domain list before having been fully initialised.
This specifically means that a racing allocation request might fail to
allocate irq data for the inner domains of a hierarchy in case the
parent domain pointer has not yet been set up.
Note that this is not really any issue for irqchip drivers that are
registered early (e.g. via IRQCHIP_DECLARE() or IRQCHIP_ACPI_DECLARE())
but could potentially cause trouble with drivers that are registered
later (e.g. modular drivers using IRQCHIP_PLATFORM_DRIVER_BEGIN(),
gpiochip drivers, etc.).
Fixes: afb7da83b9f4 ("irqdomain: Introduce helper function irq_domain_add_hierarchy()")
Cc: stable(a)vger.kernel.org # 3.19
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
[ johan: add commit message ]
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
kernel/irq/irqdomain.c | 62 +++++++++++++++++++++++++++++-------------
1 file changed, 43 insertions(+), 19 deletions(-)
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index bfda4adc05c0..8e14805c5508 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -126,23 +126,12 @@ void irq_domain_free_fwnode(struct fwnode_handle *fwnode)
}
EXPORT_SYMBOL_GPL(irq_domain_free_fwnode);
-/**
- * __irq_domain_add() - Allocate a new irq_domain data structure
- * @fwnode: firmware node for the interrupt controller
- * @size: Size of linear map; 0 for radix mapping only
- * @hwirq_max: Maximum number of interrupts supported by controller
- * @direct_max: Maximum value of direct maps; Use ~0 for no limit; 0 for no
- * direct mapping
- * @ops: domain callbacks
- * @host_data: Controller private data pointer
- *
- * Allocates and initializes an irq_domain structure.
- * Returns pointer to IRQ domain, or NULL on failure.
- */
-struct irq_domain *__irq_domain_add(struct fwnode_handle *fwnode, unsigned int size,
- irq_hw_number_t hwirq_max, int direct_max,
- const struct irq_domain_ops *ops,
- void *host_data)
+static struct irq_domain *__irq_domain_create(struct fwnode_handle *fwnode,
+ unsigned int size,
+ irq_hw_number_t hwirq_max,
+ int direct_max,
+ const struct irq_domain_ops *ops,
+ void *host_data)
{
struct irqchip_fwid *fwid;
struct irq_domain *domain;
@@ -230,12 +219,44 @@ struct irq_domain *__irq_domain_add(struct fwnode_handle *fwnode, unsigned int s
irq_domain_check_hierarchy(domain);
+ return domain;
+}
+
+static void __irq_domain_publish(struct irq_domain *domain)
+{
mutex_lock(&irq_domain_mutex);
debugfs_add_domain_dir(domain);
list_add(&domain->link, &irq_domain_list);
mutex_unlock(&irq_domain_mutex);
pr_debug("Added domain %s\n", domain->name);
+}
+
+/**
+ * __irq_domain_add() - Allocate a new irq_domain data structure
+ * @fwnode: firmware node for the interrupt controller
+ * @size: Size of linear map; 0 for radix mapping only
+ * @hwirq_max: Maximum number of interrupts supported by controller
+ * @direct_max: Maximum value of direct maps; Use ~0 for no limit; 0 for no
+ * direct mapping
+ * @ops: domain callbacks
+ * @host_data: Controller private data pointer
+ *
+ * Allocates and initializes an irq_domain structure.
+ * Returns pointer to IRQ domain, or NULL on failure.
+ */
+struct irq_domain *__irq_domain_add(struct fwnode_handle *fwnode, unsigned int size,
+ irq_hw_number_t hwirq_max, int direct_max,
+ const struct irq_domain_ops *ops,
+ void *host_data)
+{
+ struct irq_domain *domain;
+
+ domain = __irq_domain_create(fwnode, size, hwirq_max, direct_max,
+ ops, host_data);
+ if (domain)
+ __irq_domain_publish(domain);
+
return domain;
}
EXPORT_SYMBOL_GPL(__irq_domain_add);
@@ -1138,12 +1159,15 @@ struct irq_domain *irq_domain_create_hierarchy(struct irq_domain *parent,
struct irq_domain *domain;
if (size)
- domain = irq_domain_create_linear(fwnode, size, ops, host_data);
+ domain = __irq_domain_create(fwnode, size, size, 0, ops, host_data);
else
- domain = irq_domain_create_tree(fwnode, ops, host_data);
+ domain = __irq_domain_create(fwnode, 0, ~0, 0, ops, host_data);
+
if (domain) {
domain->parent = parent;
domain->flags |= flags;
+
+ __irq_domain_publish(domain);
}
return domain;
--
2.39.1
changes since v1:
- add some informations
- test it on wireless-2023-01-18 tag
- no real code change
When a connexion was established without going through
NL80211_CMD_CONNECT, the ssid was never set in the wireless_dev struct.
Now we set it during when an NL80211_CMD_AUTHENTICATE is issued.
It may be needed to test this on some additional hardware (tested with
iwlwifi and a AX201, and iwd on the userspace side), I could not test
things like roaming and p2p.
alternatives:
1. Do the same but during association and not authentication.
2. use ieee80211_bss_get_elem in nl80211_send_iface, this would report
the right ssid to userspace, but this would not fix the root cause,
this alos wa the behavior prior to 7b0a0e3c3a882 when the bug was
introduced.
This applies to v6.2-rc8 or wireless-2023-01-18,
The last linux version known to be unafected is 5.19 and the bug was
backported to the 5.19.y releases
Reported-by: Yohan Prod'homme <kernel(a)zoddo.fr>
Fixes: 7b0a0e3c3a88260b6fcb017e49f198463aa62ed1
Cc: stable(a)vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216711
Signed-off-by: Marc Bornand <dev.mbornand(a)systemb.ch>
---
net/wireless/nl80211.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 33a82ecab9d5..f1627ea542b9 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -10552,6 +10552,10 @@ static int nl80211_authenticate(struct sk_buff *skb, struct genl_info *info)
return -ENOENT;
wdev_lock(dev->ieee80211_ptr);
+
+ memcpy(dev->ieee80211_ptr->u.client.ssid, ssid, ssid_len);
+ dev->ieee80211_ptr->u.client.ssid_len = ssid_len;
+
err = cfg80211_mlme_auth(rdev, dev, &req);
wdev_unlock(dev->ieee80211_ptr);
@@ -11025,6 +11029,11 @@ static int nl80211_deauthenticate(struct sk_buff *skb, struct genl_info *info)
local_state_change = !!info->attrs[NL80211_ATTR_LOCAL_STATE_CHANGE];
wdev_lock(dev->ieee80211_ptr);
+
+ if (reason_code == WLAN_REASON_DEAUTH_LEAVING) {
+ dev->ieee80211_ptr->u.client.ssid_len = 0;
+ }
+
err = cfg80211_mlme_deauth(rdev, dev, bssid, ie, ie_len, reason_code,
local_state_change);
wdev_unlock(dev->ieee80211_ptr);
--
2.39.1
A number of Cezanne systems report IRQ1 as a wakeup source when it's not
actually a wakeup. This can cause problems for certain ACPI events. The
following fix went upstream that fixed it:
commit 8e60615e8932 ("platform/x86/amd: pmc: Disable IRQ1 wakeup for
RN/CZN")
It was reported that this fix actually helped here with older kernels too:
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1921#note_1770257
So backport this fix to 5.15.y as well. The backport is done by hand
because the driver has changed significantly. To backport this also
requires the SMU version reading function which was introduced from:
commit f6045de1f532 ("platform/x86: amd-pmc: Export Idlemask values based
on the APU")
So backport that part of that commit as well.
Mario Limonciello (1):
platform/x86/amd: pmc: Disable IRQ1 wakeup for RN/CZN
drivers/platform/x86/amd-pmc.c | 59 ++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
--
2.34.1
From: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
CONFIG_DRM_USE_DYNAMIC_DEBUG breaks debug prints for (at least modular)
drm drivers. The debug prints can be reinstated by manually frobbing
/sys/module/drm/parameters/debug after the fact, but at that point the
damage is done and all debugs from driver probe are lost. This makes
drivers totally undebuggable.
There's a more complete fix in progress [1], with further details, but
we need this fixed in stable kernels. Mark the feature as broken and
disable it by default, with hopes distros follow suit and disable it as
well.
[1] https://lore.kernel.org/r/20230125203743.564009-1-jim.cromie@gmail.com
Fixes: 84ec67288c10 ("drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro")
Cc: Jim Cromie <jim.cromie(a)gmail.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Daniel Vetter <daniel(a)ffwll.ch>
Cc: dri-devel(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v6.1+
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Signed-off-by: Jani Nikula <jani.nikula(a)intel.com>
---
drivers/gpu/drm/Kconfig | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index f42d4c6a19f2..dc0f94f02a82 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -52,7 +52,8 @@ config DRM_DEBUG_MM
config DRM_USE_DYNAMIC_DEBUG
bool "use dynamic debug to implement drm.debug"
- default y
+ default n
+ depends on BROKEN
depends on DRM
depends on DYNAMIC_DEBUG || DYNAMIC_DEBUG_CORE
depends on JUMP_LABEL
--
2.34.1
From: Leo Li <sunpeng.li(a)amd.com>
[Why]
drm_atomic_normalize_zpos() can return an error code when there's
modeset lock contention. This was being ignored.
[How]
Bail out of atomic check if normalize_zpos() returns an error.
Fixes: b261509952bc ("drm/amd/display: Fix double cursor on non-video RGB MPO")
Signed-off-by: Leo Li <sunpeng.li(a)amd.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index c10982f841f98..cb2a57503000d 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -9889,7 +9889,11 @@ static int amdgpu_dm_atomic_check(struct drm_device *dev,
* `dcn10_can_pipe_disable_cursor`). By now, all modified planes are in
* atomic state, so call drm helper to normalize zpos.
*/
- drm_atomic_normalize_zpos(dev, state);
+ ret = drm_atomic_normalize_zpos(dev, state);
+ if (ret) {
+ drm_dbg(dev, "drm_atomic_normalize_zpos() failed\n");
+ goto fail;
+ }
/* Remove exiting planes if they are modified */
for_each_oldnew_plane_in_state_reverse(state, plane, old_plane_state, new_plane_state, i) {
--
2.39.1
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
251e8c5b1b1f ("drm/i915: Move fd_install after last use of fence")
544460c33821 ("drm/i915: Multi-BB execbuf")
5851387a422c ("drm/i915/guc: Implement no mid batch preemption for multi-lrc")
e5e32171a2cf ("drm/i915/guc: Connect UAPI to GuC multi-lrc interface")
d38a9294491d ("drm/i915/guc: Update debugfs for GuC multi-lrc")
bc955204919e ("drm/i915/guc: Insert submit fences between requests in parent-child relationship")
6b540bf6f143 ("drm/i915/guc: Implement multi-lrc submission")
99b47aaddfa9 ("drm/i915/guc: Implement parallel context pin / unpin functions")
c2aa552ff09d ("drm/i915/guc: Add multi-lrc context registration")
3897df4c0187 ("drm/i915/guc: Introduce context parent-child relationship")
4f3059dc2dbb ("drm/i915: Add logical engine mapping")
1a52faed3131 ("drm/i915/guc: Take GT PM ref when deregistering context")
0ea92ace8b95 ("drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct")
0d8ee5ba8db4 ("drm/i915: Don't back up pinned LMEM context images and rings during suspend")
c56ce9565374 ("drm/i915 Implement LMEM backup and restore for suspend / resume")
0d9388635a22 ("drm/i915/ttm: Implement a function to copy the contents of two TTM-based objects")
48b096126954 ("drm/i915: Move __i915_gem_free_object to ttm_bo_destroy")
4f41ddc7c7ee ("drm/i915/guc: Add GuC kernel doc")
af5bc9f21e3a ("drm/i915/guc: Drop guc_active move everything into guc_state")
3cb3e3434b9f ("drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 251e8c5b1b1fadcc387a8e618c7437d330bdac3e Mon Sep 17 00:00:00 2001
From: Rob Clark <robdclark(a)chromium.org>
Date: Fri, 3 Feb 2023 08:49:20 -0800
Subject: [PATCH] drm/i915: Move fd_install after last use of fence
Because eb_composite_fence_create() drops the fence_array reference
after creation of the sync_file, only the sync_file holds a ref to the
fence. But fd_install() makes that reference visable to userspace, so
it must be the last thing we do with the fence.
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
Fixes: 00dae4d3d35d ("drm/i915: Implement SINGLE_TIMELINE with a syncobj (v4)")
Cc: <stable(a)vger.kernel.org> # v5.15+
[tursulin: Added stable tag.]
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230203164937.4035503-1-robd…
(cherry picked from commit 960dafa30455450d318756a9896a02727f2639e0)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index f266b68cf012..0f2e056c02dd 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3483,6 +3483,13 @@ i915_gem_do_execbuffer(struct drm_device *dev,
eb.composite_fence :
&eb.requests[0]->fence);
+ if (unlikely(eb.gem_context->syncobj)) {
+ drm_syncobj_replace_fence(eb.gem_context->syncobj,
+ eb.composite_fence ?
+ eb.composite_fence :
+ &eb.requests[0]->fence);
+ }
+
if (out_fence) {
if (err == 0) {
fd_install(out_fence_fd, out_fence->file);
@@ -3494,13 +3501,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
}
}
- if (unlikely(eb.gem_context->syncobj)) {
- drm_syncobj_replace_fence(eb.gem_context->syncobj,
- eb.composite_fence ?
- eb.composite_fence :
- &eb.requests[0]->fence);
- }
-
if (!out_fence && eb.composite_fence)
dma_fence_put(eb.composite_fence);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
85e26dd5100a ("drm/client: fix circular reference counting issue")
444bbba708e8 ("drm/client: Prevent NULL dereference in drm_client_buffer_delete()")
27b2ae654370 ("drm/client: Switch drm_client_buffer_delete() to unlocked drm_gem_vunmap")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 85e26dd5100a182bf8448050427539c0a66ab793 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig(a)amd.com>
Date: Thu, 26 Jan 2023 10:24:26 +0100
Subject: [PATCH] drm/client: fix circular reference counting issue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
We reference dump buffers both by their handle as well as their
object. The problem is now that when anybody iterates over the DRM
framebuffers and exports the underlying GEM objects through DMA-buf
we run into a circular reference count situation.
The result is that the fbdev handling holds the GEM handle preventing
the DMA-buf in the GEM object to be released. This DMA-buf in turn
holds a reference to the driver module which on unload would release
the fbdev.
Break that loop by releasing the handle as soon as the DRM
framebuffer object is created. The DRM framebuffer and the DRM client
buffer structure still hold a reference to the underlying GEM object
preventing its destruction.
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Fixes: c76f0f7cb546 ("drm: Begin an API for in-kernel clients")
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Tested-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20230126102814.8722-1-christi…
diff --git a/drivers/gpu/drm/drm_client.c b/drivers/gpu/drm/drm_client.c
index fd67efe37c63..056ab9d5f313 100644
--- a/drivers/gpu/drm/drm_client.c
+++ b/drivers/gpu/drm/drm_client.c
@@ -233,21 +233,17 @@ void drm_client_dev_restore(struct drm_device *dev)
static void drm_client_buffer_delete(struct drm_client_buffer *buffer)
{
- struct drm_device *dev = buffer->client->dev;
-
if (buffer->gem) {
drm_gem_vunmap_unlocked(buffer->gem, &buffer->map);
drm_gem_object_put(buffer->gem);
}
- if (buffer->handle)
- drm_mode_destroy_dumb(dev, buffer->handle, buffer->client->file);
-
kfree(buffer);
}
static struct drm_client_buffer *
-drm_client_buffer_create(struct drm_client_dev *client, u32 width, u32 height, u32 format)
+drm_client_buffer_create(struct drm_client_dev *client, u32 width, u32 height,
+ u32 format, u32 *handle)
{
const struct drm_format_info *info = drm_format_info(format);
struct drm_mode_create_dumb dumb_args = { };
@@ -269,16 +265,15 @@ drm_client_buffer_create(struct drm_client_dev *client, u32 width, u32 height, u
if (ret)
goto err_delete;
- buffer->handle = dumb_args.handle;
- buffer->pitch = dumb_args.pitch;
-
obj = drm_gem_object_lookup(client->file, dumb_args.handle);
if (!obj) {
ret = -ENOENT;
goto err_delete;
}
+ buffer->pitch = dumb_args.pitch;
buffer->gem = obj;
+ *handle = dumb_args.handle;
return buffer;
@@ -365,7 +360,8 @@ static void drm_client_buffer_rmfb(struct drm_client_buffer *buffer)
}
static int drm_client_buffer_addfb(struct drm_client_buffer *buffer,
- u32 width, u32 height, u32 format)
+ u32 width, u32 height, u32 format,
+ u32 handle)
{
struct drm_client_dev *client = buffer->client;
struct drm_mode_fb_cmd fb_req = { };
@@ -377,7 +373,7 @@ static int drm_client_buffer_addfb(struct drm_client_buffer *buffer,
fb_req.depth = info->depth;
fb_req.width = width;
fb_req.height = height;
- fb_req.handle = buffer->handle;
+ fb_req.handle = handle;
fb_req.pitch = buffer->pitch;
ret = drm_mode_addfb(client->dev, &fb_req, client->file);
@@ -414,13 +410,24 @@ struct drm_client_buffer *
drm_client_framebuffer_create(struct drm_client_dev *client, u32 width, u32 height, u32 format)
{
struct drm_client_buffer *buffer;
+ u32 handle;
int ret;
- buffer = drm_client_buffer_create(client, width, height, format);
+ buffer = drm_client_buffer_create(client, width, height, format,
+ &handle);
if (IS_ERR(buffer))
return buffer;
- ret = drm_client_buffer_addfb(buffer, width, height, format);
+ ret = drm_client_buffer_addfb(buffer, width, height, format, handle);
+
+ /*
+ * The handle is only needed for creating the framebuffer, destroy it
+ * again to solve a circular dependency should anybody export the GEM
+ * object as DMA-buf. The framebuffer and our buffer structure are still
+ * holding references to the GEM object to prevent its destruction.
+ */
+ drm_mode_destroy_dumb(client->dev, handle, client->file);
+
if (ret) {
drm_client_buffer_delete(buffer);
return ERR_PTR(ret);
diff --git a/include/drm/drm_client.h b/include/drm/drm_client.h
index 4fc8018eddda..1220d185c776 100644
--- a/include/drm/drm_client.h
+++ b/include/drm/drm_client.h
@@ -126,11 +126,6 @@ struct drm_client_buffer {
*/
struct drm_client_dev *client;
- /**
- * @handle: Buffer handle
- */
- u32 handle;
-
/**
* @pitch: Buffer pitch
*/
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
85e26dd5100a ("drm/client: fix circular reference counting issue")
444bbba708e8 ("drm/client: Prevent NULL dereference in drm_client_buffer_delete()")
27b2ae654370 ("drm/client: Switch drm_client_buffer_delete() to unlocked drm_gem_vunmap")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 85e26dd5100a182bf8448050427539c0a66ab793 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig(a)amd.com>
Date: Thu, 26 Jan 2023 10:24:26 +0100
Subject: [PATCH] drm/client: fix circular reference counting issue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
We reference dump buffers both by their handle as well as their
object. The problem is now that when anybody iterates over the DRM
framebuffers and exports the underlying GEM objects through DMA-buf
we run into a circular reference count situation.
The result is that the fbdev handling holds the GEM handle preventing
the DMA-buf in the GEM object to be released. This DMA-buf in turn
holds a reference to the driver module which on unload would release
the fbdev.
Break that loop by releasing the handle as soon as the DRM
framebuffer object is created. The DRM framebuffer and the DRM client
buffer structure still hold a reference to the underlying GEM object
preventing its destruction.
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Fixes: c76f0f7cb546 ("drm: Begin an API for in-kernel clients")
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Tested-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20230126102814.8722-1-christi…
diff --git a/drivers/gpu/drm/drm_client.c b/drivers/gpu/drm/drm_client.c
index fd67efe37c63..056ab9d5f313 100644
--- a/drivers/gpu/drm/drm_client.c
+++ b/drivers/gpu/drm/drm_client.c
@@ -233,21 +233,17 @@ void drm_client_dev_restore(struct drm_device *dev)
static void drm_client_buffer_delete(struct drm_client_buffer *buffer)
{
- struct drm_device *dev = buffer->client->dev;
-
if (buffer->gem) {
drm_gem_vunmap_unlocked(buffer->gem, &buffer->map);
drm_gem_object_put(buffer->gem);
}
- if (buffer->handle)
- drm_mode_destroy_dumb(dev, buffer->handle, buffer->client->file);
-
kfree(buffer);
}
static struct drm_client_buffer *
-drm_client_buffer_create(struct drm_client_dev *client, u32 width, u32 height, u32 format)
+drm_client_buffer_create(struct drm_client_dev *client, u32 width, u32 height,
+ u32 format, u32 *handle)
{
const struct drm_format_info *info = drm_format_info(format);
struct drm_mode_create_dumb dumb_args = { };
@@ -269,16 +265,15 @@ drm_client_buffer_create(struct drm_client_dev *client, u32 width, u32 height, u
if (ret)
goto err_delete;
- buffer->handle = dumb_args.handle;
- buffer->pitch = dumb_args.pitch;
-
obj = drm_gem_object_lookup(client->file, dumb_args.handle);
if (!obj) {
ret = -ENOENT;
goto err_delete;
}
+ buffer->pitch = dumb_args.pitch;
buffer->gem = obj;
+ *handle = dumb_args.handle;
return buffer;
@@ -365,7 +360,8 @@ static void drm_client_buffer_rmfb(struct drm_client_buffer *buffer)
}
static int drm_client_buffer_addfb(struct drm_client_buffer *buffer,
- u32 width, u32 height, u32 format)
+ u32 width, u32 height, u32 format,
+ u32 handle)
{
struct drm_client_dev *client = buffer->client;
struct drm_mode_fb_cmd fb_req = { };
@@ -377,7 +373,7 @@ static int drm_client_buffer_addfb(struct drm_client_buffer *buffer,
fb_req.depth = info->depth;
fb_req.width = width;
fb_req.height = height;
- fb_req.handle = buffer->handle;
+ fb_req.handle = handle;
fb_req.pitch = buffer->pitch;
ret = drm_mode_addfb(client->dev, &fb_req, client->file);
@@ -414,13 +410,24 @@ struct drm_client_buffer *
drm_client_framebuffer_create(struct drm_client_dev *client, u32 width, u32 height, u32 format)
{
struct drm_client_buffer *buffer;
+ u32 handle;
int ret;
- buffer = drm_client_buffer_create(client, width, height, format);
+ buffer = drm_client_buffer_create(client, width, height, format,
+ &handle);
if (IS_ERR(buffer))
return buffer;
- ret = drm_client_buffer_addfb(buffer, width, height, format);
+ ret = drm_client_buffer_addfb(buffer, width, height, format, handle);
+
+ /*
+ * The handle is only needed for creating the framebuffer, destroy it
+ * again to solve a circular dependency should anybody export the GEM
+ * object as DMA-buf. The framebuffer and our buffer structure are still
+ * holding references to the GEM object to prevent its destruction.
+ */
+ drm_mode_destroy_dumb(client->dev, handle, client->file);
+
if (ret) {
drm_client_buffer_delete(buffer);
return ERR_PTR(ret);
diff --git a/include/drm/drm_client.h b/include/drm/drm_client.h
index 4fc8018eddda..1220d185c776 100644
--- a/include/drm/drm_client.h
+++ b/include/drm/drm_client.h
@@ -126,11 +126,6 @@ struct drm_client_buffer {
*/
struct drm_client_dev *client;
- /**
- * @handle: Buffer handle
- */
- u32 handle;
-
/**
* @pitch: Buffer pitch
*/
From: Xiubo Li <xiubli(a)redhat.com>
The fallocate will try to clear the suid/sgid if a unprevileged user
changed the file.
There is no Posix item requires that we should clear the suid/sgid
in fallocate code path but this is the default behaviour for most of
the filesystems and the VFS layer. And also the same for the write
code path, which have already support it.
And also we need to update the time stamps since the fallocate will
change the file contents.
Cc: stable(a)vger.kernel.org
URL: https://tracker.ceph.com/issues/58054
Signed-off-by: Xiubo Li <xiubli(a)redhat.com>
---
fs/ceph/file.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 903de296f0d3..dee3b445f415 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2502,6 +2502,9 @@ static long ceph_fallocate(struct file *file, int mode,
loff_t endoff = 0;
loff_t size;
+ dout("%s %p %llx.%llx mode %x, offset %llu length %llu\n", __func__,
+ inode, ceph_vinop(inode), mode, offset, length);
+
if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
@@ -2539,6 +2542,10 @@ static long ceph_fallocate(struct file *file, int mode,
if (ret < 0)
goto unlock;
+ ret = file_modified(file);
+ if (ret)
+ goto put_caps;
+
filemap_invalidate_lock(inode->i_mapping);
ceph_fscache_invalidate(inode, false);
ceph_zero_pagecache_range(inode, offset, length);
@@ -2554,6 +2561,7 @@ static long ceph_fallocate(struct file *file, int mode,
}
filemap_invalidate_unlock(inode->i_mapping);
+put_caps:
ceph_put_cap_refs(ci, got);
unlock:
inode_unlock(inode);
--
2.31.1
Hi all
We found a warning from objtool:
arch/x86/entry/entry_64.o: warning: objtool: .entry.text+0x1d1:
unsupported intra-function call
and if we enable retpoline in config:
arch/x86/entry/entry_64.o: warning: objtool: .entry.text+0x1c1:
unsupported intra-function call
arch/x86/entry/entry_64.o: warning: objtool: If this is a retpoline,
please patch it in with alternatives and annotate it with
ANNOTATE_NOSPEC_ALTERNATIVE.
I found this issue has been introduced since “x86/speculation: Change
FILL_RETURN_BUFFER to work with objtool( commit 8afd1c7da2)”backported
in v5.4.217.
Comparing with the upstream version(commit 089dd8e53):
There is no “ANNOTATE_INTRA_FUNCTION_CALL” in v5.4 for missing
dependency patch. When the “ANNOTATE_NOSPEC_ALTERNATIVE” is removed,
this issue just occurs.
I tried to backport “ANNOTATE_INTRA_FUNCTION_CALL”and its dependency
patchs in v5.4, but I met the CFA miss match issue from objtool.
So, please help check this issue in v5.4 LTS version.
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
519b7e13b5ae ("btrfs: lock the inode in shared mode before starting fiemap")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 519b7e13b5ae8dd38da1e52275705343be6bb508 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 23 Jan 2023 16:54:46 +0000
Subject: [PATCH] btrfs: lock the inode in shared mode before starting fiemap
Currently fiemap does not take the inode's lock (VFS lock), it only locks
a file range in the inode's io tree. This however can lead to a deadlock
if we have a concurrent fsync on the file and fiemap code triggers a fault
when accessing the user space buffer with fiemap_fill_next_extent(). The
deadlock happens on the inode's i_mmap_lock semaphore, which is taken both
by fsync and btrfs_page_mkwrite(). This deadlock was recently reported by
syzbot and triggers a trace like the following:
task:syz-executor361 state:D stack:20264 pid:5668 ppid:5119 flags:0x00004004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5293 [inline]
__schedule+0x995/0xe20 kernel/sched/core.c:6606
schedule+0xcb/0x190 kernel/sched/core.c:6682
wait_on_state fs/btrfs/extent-io-tree.c:707 [inline]
wait_extent_bit+0x577/0x6f0 fs/btrfs/extent-io-tree.c:751
lock_extent+0x1c2/0x280 fs/btrfs/extent-io-tree.c:1742
find_lock_delalloc_range+0x4e6/0x9c0 fs/btrfs/extent_io.c:488
writepage_delalloc+0x1ef/0x540 fs/btrfs/extent_io.c:1863
__extent_writepage+0x736/0x14e0 fs/btrfs/extent_io.c:2174
extent_write_cache_pages+0x983/0x1220 fs/btrfs/extent_io.c:3091
extent_writepages+0x219/0x540 fs/btrfs/extent_io.c:3211
do_writepages+0x3c3/0x680 mm/page-writeback.c:2581
filemap_fdatawrite_wbc+0x11e/0x170 mm/filemap.c:388
__filemap_fdatawrite_range mm/filemap.c:421 [inline]
filemap_fdatawrite_range+0x175/0x200 mm/filemap.c:439
btrfs_fdatawrite_range fs/btrfs/file.c:3850 [inline]
start_ordered_ops fs/btrfs/file.c:1737 [inline]
btrfs_sync_file+0x4ff/0x1190 fs/btrfs/file.c:1839
generic_write_sync include/linux/fs.h:2885 [inline]
btrfs_do_write_iter+0xcd3/0x1280 fs/btrfs/file.c:1684
call_write_iter include/linux/fs.h:2189 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x7dc/0xc50 fs/read_write.c:584
ksys_write+0x177/0x2a0 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f7d4054e9b9
RSP: 002b:00007f7d404fa2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f7d405d87a0 RCX: 00007f7d4054e9b9
RDX: 0000000000000090 RSI: 0000000020000000 RDI: 0000000000000006
RBP: 00007f7d405a51d0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 61635f65646f6e69
R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87a8
</TASK>
INFO: task syz-executor361:5697 blocked for more than 145 seconds.
Not tainted 6.2.0-rc3-syzkaller-00376-g7c6984405241 #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor361 state:D stack:21216 pid:5697 ppid:5119 flags:0x00004004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5293 [inline]
__schedule+0x995/0xe20 kernel/sched/core.c:6606
schedule+0xcb/0x190 kernel/sched/core.c:6682
rwsem_down_read_slowpath+0x5f9/0x930 kernel/locking/rwsem.c:1095
__down_read_common+0x54/0x2a0 kernel/locking/rwsem.c:1260
btrfs_page_mkwrite+0x417/0xc80 fs/btrfs/inode.c:8526
do_page_mkwrite+0x19e/0x5e0 mm/memory.c:2947
wp_page_shared+0x15e/0x380 mm/memory.c:3295
handle_pte_fault mm/memory.c:4949 [inline]
__handle_mm_fault mm/memory.c:5073 [inline]
handle_mm_fault+0x1b79/0x26b0 mm/memory.c:5219
do_user_addr_fault+0x69b/0xcb0 arch/x86/mm/fault.c:1428
handle_page_fault arch/x86/mm/fault.c:1519 [inline]
exc_page_fault+0x7a/0x110 arch/x86/mm/fault.c:1575
asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
RIP: 0010:copy_user_short_string+0xd/0x40 arch/x86/lib/copy_user_64.S:233
Code: 74 0a 89 (...)
RSP: 0018:ffffc9000570f330 EFLAGS: 00050202
RAX: ffffffff843e6601 RBX: 00007fffffffefc8 RCX: 0000000000000007
RDX: 0000000000000000 RSI: ffffc9000570f3e0 RDI: 0000000020000120
RBP: ffffc9000570f490 R08: 0000000000000000 R09: fffff52000ae1e83
R10: fffff52000ae1e83 R11: 1ffff92000ae1e7c R12: 0000000000000038
R13: ffffc9000570f3e0 R14: 0000000020000120 R15: ffffc9000570f3e0
copy_user_generic arch/x86/include/asm/uaccess_64.h:37 [inline]
raw_copy_to_user arch/x86/include/asm/uaccess_64.h:58 [inline]
_copy_to_user+0xe9/0x130 lib/usercopy.c:34
copy_to_user include/linux/uaccess.h:169 [inline]
fiemap_fill_next_extent+0x22e/0x410 fs/ioctl.c:144
emit_fiemap_extent+0x22d/0x3c0 fs/btrfs/extent_io.c:3458
fiemap_process_hole+0xa00/0xad0 fs/btrfs/extent_io.c:3716
extent_fiemap+0xe27/0x2100 fs/btrfs/extent_io.c:3922
btrfs_fiemap+0x172/0x1e0 fs/btrfs/inode.c:8209
ioctl_fiemap fs/ioctl.c:219 [inline]
do_vfs_ioctl+0x185b/0x2980 fs/ioctl.c:810
__do_sys_ioctl fs/ioctl.c:868 [inline]
__se_sys_ioctl+0x83/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f7d4054e9b9
RSP: 002b:00007f7d390d92f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f7d405d87b0 RCX: 00007f7d4054e9b9
RDX: 0000000020000100 RSI: 00000000c020660b RDI: 0000000000000005
RBP: 00007f7d405a51d0 R08: 00007f7d390d9700 R09: 0000000000000000
R10: 00007f7d390d9700 R11: 0000000000000246 R12: 61635f65646f6e69
R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87b8
</TASK>
What happens is the following:
1) Task A is doing an fsync, enters btrfs_sync_file() and flushes delalloc
before locking the inode and the i_mmap_lock semaphore, that is, before
calling btrfs_inode_lock();
2) After task A flushes delalloc and before it calls btrfs_inode_lock(),
another task dirties a page;
3) Task B starts a fiemap without FIEMAP_FLAG_SYNC, so the page dirtied
at step 2 remains dirty and unflushed. Then when it enters
extent_fiemap() and it locks a file range that includes the range of
the page dirtied in step 2;
4) Task A calls btrfs_inode_lock() and locks the inode (VFS lock) and the
inode's i_mmap_lock semaphore in write mode. Then it tries to flush
delalloc by calling start_ordered_ops(), which will block, at
find_lock_delalloc_range(), when trying to lock the range of the page
dirtied at step 2, since this range was locked by the fiemap task (at
step 3);
5) Task B generates a page fault when accessing the user space fiemap
buffer with a call to fiemap_fill_next_extent().
The fault handler needs to call btrfs_page_mkwrite() for some other
page of our inode, and there we deadlock when trying to lock the
inode's i_mmap_lock semaphore in read mode, since the fsync task locked
it in write mode (step 4) and the fsync task can not progress because
it's waiting to lock a file range that is currently locked by us (the
fiemap task, step 3).
Fix this by taking the inode's lock (VFS lock) in shared mode when
entering fiemap. This effectively serializes fiemap with fsync (except the
most expensive part of fsync, the log sync), preventing this deadlock.
Reported-by: syzbot+cc35f55c41e34c30dcb5(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000032dc7305f2a66f46@google.com/
CC: stable(a)vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9bd32daa9b9a..3bbf8703db2a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3826,6 +3826,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
lockend = round_up(start + len, inode->root->fs_info->sectorsize);
prev_extent_end = lockstart;
+ btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
@@ -4019,6 +4020,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
out_unlock:
unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
+ btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
out:
free_extent_state(delalloc_cached_state);
btrfs_free_backref_share_ctx(backref_ctx);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7ece674cd946 ("Revert "drm/amd/display: disable S/G display on DCN 3.1.4"")
077e9659581a ("drm/amd/display: disable S/G display on DCN 3.1.2/3")
a52287d66dfa ("drm/amd/display: disable S/G display on DCN 3.1.4")
e78cc6a4c748 ("drm/amd/display: disable S/G display on DCN 3.1.5")
fe6872adb05e ("drm/amd/display: Add DCN314 display SG Support")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7ece674cd9468ce740494f6108c39831cfc7eb4e Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher(a)amd.com>
Date: Tue, 31 Jan 2023 13:10:55 -0500
Subject: [PATCH] Revert "drm/amd/display: disable S/G display on DCN 3.1.4"
This reverts commit 9aa15370819294beb7eb67c9dcbf654d79ff8790.
This is fixed now so we can re-enable S/G display on DCN
3.1.4.
Reviewed-by: Yifan Zhang <yifan1.zhang(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org # 6.1.x
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 7f6e27561899..78452856b2a3 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1514,6 +1514,7 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
init_data.flags.gpu_vm_support = true;
break;
case IP_VERSION(3, 0, 1):
+ case IP_VERSION(3, 1, 4):
case IP_VERSION(3, 1, 6):
init_data.flags.gpu_vm_support = true;
break;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
ad2171009d96 ("mptcp: fix locking for in-kernel listener creation")
976d302fb616 ("mptcp: deduplicate error paths on endpoint creation")
3eb9a6b6503c ("mptcp: account memory allocation in mptcp_nl_cmd_add_addr() to user")
d045b9eb95a9 ("mptcp: introduce implicit endpoints")
33397b83eee6 ("selftests: mptcp: add backup with port testcase")
09f12c3ab7a5 ("mptcp: allow to use port and non-signal in set_flags")
6a0653b96f5d ("selftests: mptcp: add fullmesh setting tests")
327b9a94e2a8 ("selftests: mptcp: more stable join tests-cases")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ad2171009d968104ccda9dc517f5a3ba891515db Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Tue, 7 Feb 2023 14:04:15 +0100
Subject: [PATCH] mptcp: fix locking for in-kernel listener creation
For consistency, in mptcp_pm_nl_create_listen_socket(), we need to
call the __mptcp_nmpc_socket() under the msk socket lock.
Note that as a side effect, mptcp_subflow_create_socket() needs a
'nested' lockdep annotation, as it will acquire the subflow (kernel)
socket lock under the in-kernel listener msk socket lock.
The current lack of locking is almost harmless, because the relevant
socket is not exposed to the user space, but in future we will add
more complexity to the mentioned helper, let's play safe.
Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 2ea7eae43bdb..10fe9771a852 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -998,8 +998,8 @@ static int mptcp_pm_nl_create_listen_socket(struct sock *sk,
{
int addrlen = sizeof(struct sockaddr_in);
struct sockaddr_storage addr;
- struct mptcp_sock *msk;
struct socket *ssock;
+ struct sock *newsk;
int backlog = 1024;
int err;
@@ -1008,11 +1008,13 @@ static int mptcp_pm_nl_create_listen_socket(struct sock *sk,
if (err)
return err;
- msk = mptcp_sk(entry->lsk->sk);
- if (!msk)
+ newsk = entry->lsk->sk;
+ if (!newsk)
return -EINVAL;
- ssock = __mptcp_nmpc_socket(msk);
+ lock_sock(newsk);
+ ssock = __mptcp_nmpc_socket(mptcp_sk(newsk));
+ release_sock(newsk);
if (!ssock)
return -EINVAL;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index ec54413fb31f..a3e5026bee5b 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1679,7 +1679,7 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
if (err)
return err;
- lock_sock(sf->sk);
+ lock_sock_nested(sf->sk, SINGLE_DEPTH_NESTING);
/* the newly created socket has to be in the same cgroup as its parent */
mptcp_attach_cgroup(sk, sf->sk);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
ad2171009d96 ("mptcp: fix locking for in-kernel listener creation")
976d302fb616 ("mptcp: deduplicate error paths on endpoint creation")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ad2171009d968104ccda9dc517f5a3ba891515db Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Tue, 7 Feb 2023 14:04:15 +0100
Subject: [PATCH] mptcp: fix locking for in-kernel listener creation
For consistency, in mptcp_pm_nl_create_listen_socket(), we need to
call the __mptcp_nmpc_socket() under the msk socket lock.
Note that as a side effect, mptcp_subflow_create_socket() needs a
'nested' lockdep annotation, as it will acquire the subflow (kernel)
socket lock under the in-kernel listener msk socket lock.
The current lack of locking is almost harmless, because the relevant
socket is not exposed to the user space, but in future we will add
more complexity to the mentioned helper, let's play safe.
Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 2ea7eae43bdb..10fe9771a852 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -998,8 +998,8 @@ static int mptcp_pm_nl_create_listen_socket(struct sock *sk,
{
int addrlen = sizeof(struct sockaddr_in);
struct sockaddr_storage addr;
- struct mptcp_sock *msk;
struct socket *ssock;
+ struct sock *newsk;
int backlog = 1024;
int err;
@@ -1008,11 +1008,13 @@ static int mptcp_pm_nl_create_listen_socket(struct sock *sk,
if (err)
return err;
- msk = mptcp_sk(entry->lsk->sk);
- if (!msk)
+ newsk = entry->lsk->sk;
+ if (!newsk)
return -EINVAL;
- ssock = __mptcp_nmpc_socket(msk);
+ lock_sock(newsk);
+ ssock = __mptcp_nmpc_socket(mptcp_sk(newsk));
+ release_sock(newsk);
if (!ssock)
return -EINVAL;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index ec54413fb31f..a3e5026bee5b 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1679,7 +1679,7 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
if (err)
return err;
- lock_sock(sf->sk);
+ lock_sock_nested(sf->sk, SINGLE_DEPTH_NESTING);
/* the newly created socket has to be in the same cgroup as its parent */
mptcp_attach_cgroup(sk, sf->sk);
From: Xiubo Li <xiubli(a)redhat.com>
The fallocate will try to clear the suid/sgid if a unprevileged user
changed the file.
There is no Posix item requires that we should clear the suid/sgid
in fallocate code path but this is the default behaviour for most of
the filesystems and the VFS layer. And also the same for the write
code path, which have already support it.
And also we need to update the time stamps since the fallocate will
change the file contents.
Cc: stable(a)vger.kernel.org
URL: https://tracker.ceph.com/issues/58054
Signed-off-by: Xiubo Li <xiubli(a)redhat.com>
---
fs/ceph/file.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 903de296f0d3..dee3b445f415 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2502,6 +2502,9 @@ static long ceph_fallocate(struct file *file, int mode,
loff_t endoff = 0;
loff_t size;
+ dout("%s %p %llx.%llx mode %x, offset %llu length %llu\n", __func__,
+ inode, ceph_vinop(inode), mode, offset, length);
+
if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
@@ -2539,6 +2542,10 @@ static long ceph_fallocate(struct file *file, int mode,
if (ret < 0)
goto unlock;
+ ret = file_modified(file);
+ if (ret)
+ goto put_caps;
+
filemap_invalidate_lock(inode->i_mapping);
ceph_fscache_invalidate(inode, false);
ceph_zero_pagecache_range(inode, offset, length);
@@ -2554,6 +2561,7 @@ static long ceph_fallocate(struct file *file, int mode,
}
filemap_invalidate_unlock(inode->i_mapping);
+put_caps:
ceph_put_cap_refs(ci, got);
unlock:
inode_unlock(inode);
--
2.31.1
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
21e43569685d ("mptcp: fix locking for setsockopt corner-case")
d3d429047cc6 ("mptcp: sockopt: make 'tcp_fastopen_connect' generic")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 21e43569685de4ad773fb060c11a15f3fd5e7ac4 Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Tue, 7 Feb 2023 14:04:14 +0100
Subject: [PATCH] mptcp: fix locking for setsockopt corner-case
We need to call the __mptcp_nmpc_socket(), and later subflow socket
access under the msk socket lock, or e.g. a racing connect() could
change the socket status under the hood, with unexpected results.
Fixes: 54635bd04701 ("mptcp: add TCP_FASTOPEN_CONNECT socket option")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index d4b1e6ec1b36..7f2c3727ab23 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -760,14 +760,21 @@ static int mptcp_setsockopt_v4(struct mptcp_sock *msk, int optname,
static int mptcp_setsockopt_first_sf_only(struct mptcp_sock *msk, int level, int optname,
sockptr_t optval, unsigned int optlen)
{
+ struct sock *sk = (struct sock *)msk;
struct socket *sock;
+ int ret = -EINVAL;
/* Limit to first subflow, before the connection establishment */
+ lock_sock(sk);
sock = __mptcp_nmpc_socket(msk);
if (!sock)
- return -EINVAL;
+ goto unlock;
- return tcp_setsockopt(sock->sk, level, optname, optval, optlen);
+ ret = tcp_setsockopt(sock->sk, level, optname, optval, optlen);
+
+unlock:
+ release_sock(sk);
+ return ret;
}
static int mptcp_setsockopt_sol_tcp(struct mptcp_sock *msk, int optname,
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
d4e85922e3e7 ("mptcp: do not wait for bare sockets' timeout")
76a13b315709 ("mptcp: invoke MP_FAIL response when needed")
d9fb797046c5 ("mptcp: Do not traverse the subflow connection list without lock")
d42f9e4e2384 ("mptcp: Check for orphaned subflow before handling MP_FAIL timer")
d7e6f5836038 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d4e85922e3e7ef2071f91f65e61629b60f3a9cf4 Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Tue, 7 Feb 2023 14:04:13 +0100
Subject: [PATCH] mptcp: do not wait for bare sockets' timeout
If the peer closes all the existing subflows for a given
mptcp socket and later the application closes it, the current
implementation let it survive until the timewait timeout expires.
While the above is allowed by the protocol specification it
consumes resources for almost no reason and additionally
causes sporadic self-tests failures.
Let's move the mptcp socket to the TCP_CLOSE state when there are
no alive subflows at close time, so that the allocated resources
will be freed immediately.
Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8cd6cc67c2c5..bc6c1f62a690 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2897,6 +2897,7 @@ bool __mptcp_close(struct sock *sk, long timeout)
struct mptcp_subflow_context *subflow;
struct mptcp_sock *msk = mptcp_sk(sk);
bool do_cancel_work = false;
+ int subflows_alive = 0;
sk->sk_shutdown = SHUTDOWN_MASK;
@@ -2922,6 +2923,8 @@ bool __mptcp_close(struct sock *sk, long timeout)
struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
bool slow = lock_sock_fast_nested(ssk);
+ subflows_alive += ssk->sk_state != TCP_CLOSE;
+
/* since the close timeout takes precedence on the fail one,
* cancel the latter
*/
@@ -2937,6 +2940,12 @@ bool __mptcp_close(struct sock *sk, long timeout)
}
sock_orphan(sk);
+ /* all the subflows are closed, only timeout can change the msk
+ * state, let's not keep resources busy for no reasons
+ */
+ if (subflows_alive == 0)
+ inet_sk_state_store(sk, TCP_CLOSE);
+
sock_hold(sk);
pr_debug("msk=%p state=%d", sk, sk->sk_state);
if (msk->token)
--
Hello Dear Good Day,
I hope you are doing great,
I have something important to discuss with you
if you give me a listening ear.so that I can
write you in details thank you as i wait for
your reply.
Miss Ann Hester
This bug is marked as fixed by commit:
net: core: netlink: add helper refcount dec and lock function
net: sched: add helper function to take reference to Qdisc
net: sched: extend Qdisc with rcu
net: sched: rename qdisc_destroy() to qdisc_put()
net: sched: use Qdisc rcu API instead of relying on rtnl lock
But I can't find it in the tested trees[1] for more than 90 days.
Is it a correct commit? Please update it by replying:
#syz fix: exact-commit-title
Until then the bug is still considered open and new crashes with
the same signature are ignored.
Kernel: Linux 4.19
Dashboard link: https://syzkaller.appspot.com/bug?extid=5f229e48cccc804062c0
---
[1] I expect the commit to be present in:
1. linux-4.19.y branch of
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git