Hi, this is your Linux kernel regression tracker speaking. Top-posting
for once, to make this easy accessible to everyone.
Below issue that started to happen between v5.10.80..v5.10.90 was
recently reported to bugzilla, but the reporter didn't even get a single
reply afaics. Could somebody maybe take a look? Bisection is likely no
easy in this case, so a few tips to narrow down the area to search might
help a lot here.
https://bugzilla.kernel.org/show_bug.cgi?id=215562
Ciao, Thorsten
On 03.02.22 16:03, Thorsten Leemhuis wrote:
> Hi, this is your Linux kernel regression tracker speaking.
>
> There is a regression in bugzilla.kernel.org I'd like to add to the
> tracking:
>
> #regzbot introduced: v5.10.80..v5.10.90
> #regzbot from: Patrick Schaaf <kernelorg(a)bof.de>
> #regzbot title: mm: unable to handle page fault in cache_reap
> #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215562
>
> Quote:
>
>> We've been running self-built 5.10.x kernels on DL380 hosts for quite a while, also inside the VMs there.
>>
>> With I think 5.10.90 three weeks or so back, we experienced a lockup upon umounting a larger, dirty filesystem on the host side, unfortunately without capturing a backtrace back then.
>>
>> Today something feeling similar, happened again, on a machine running 5.10.93 both on the host and inside its 10 various VMs.
>>
>> Problem showed shortly (minutes) after shutting down one of the VMs (few hundred GB memory / dataset, VM shutdown was complete already; direct I/O), and then some LVM volume renames, a quick short outside ext4 mount followed by an umount (8 GB volume, probably a few hundred megabyte only to write). Actually monitoring suggests that disk writes were already done about a minute before the onset.
>>
>> What we then experienced, was the following BUG:, followed by one after the other CPU saying goodbye with soft lockup messages over the course of a few minutes; meanwhile there was no more pinging the box, logging in on console, etc. We hard powercycled and it recovered fully.
>>
>> here's the BUG that was logged; if it is useful for someone to see the followup soft lockup messages, tell me + I'll add them.
>>
>> Feb 02 15:22:27 kvm3j kernel: BUG: unable to handle page fault for address: ffffebde00000008
>> Feb 02 15:22:27 kvm3j kernel: #PF: supervisor read access in kernel mode
>> Feb 02 15:22:27 kvm3j kernel: #PF: error_code(0x0000) - not-present page
>> Feb 02 15:22:27 kvm3j kernel: Oops: 0000 [#1] SMP PTI
>> Feb 02 15:22:27 kvm3j kernel: CPU: 7 PID: 39833 Comm: kworker/7:0 Tainted: G I 5.10.93-kvm #1
>> Feb 02 15:22:27 kvm3j kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/20/2013
>> Feb 02 15:22:27 kvm3j kernel: Workqueue: events cache_reap
>> Feb 02 15:22:27 kvm3j kernel: RIP: 0010:free_block.constprop.0+0xc0/0x1f0
>> Feb 02 15:22:27 kvm3j kernel: Code: 4c 8b 16 4c 89 d0 48 01 e8 0f 82 32 01 00 00 4c 89 f2 48 bb 00 00 00 00 00 ea ff ff 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 01 d8 <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 >
>> Feb 02 15:22:27 kvm3j kernel: RSP: 0018:ffffc9000252bdc8 EFLAGS: 00010086
>> Feb 02 15:22:27 kvm3j kernel: RAX: ffffebde00000000 RBX: ffffea0000000000 RCX: ffff888889141b00
>> Feb 02 15:22:27 kvm3j kernel: RDX: 0000777f80000000 RSI: ffff893d3edf3400 RDI: ffff8881000403c0
>> Feb 02 15:22:27 kvm3j kernel: RBP: 0000000080000000 R08: ffff888100041300 R09: 0000000000000003
>> Feb 02 15:22:27 kvm3j kernel: R10: 0000000000000000 R11: ffff888100041308 R12: dead000000000122
>> Feb 02 15:22:27 kvm3j kernel: R13: dead000000000100 R14: 0000777f80000000 R15: ffff893ed8780d60
>> Feb 02 15:22:27 kvm3j kernel: FS: 0000000000000000(0000) GS:ffff893d3edc0000(0000) knlGS:0000000000000000
>> Feb 02 15:22:27 kvm3j kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Feb 02 15:22:27 kvm3j kernel: CR2: ffffebde00000008 CR3: 000000048c4aa002 CR4: 00000000001726e0
>> Feb 02 15:22:27 kvm3j kernel: Call Trace:
>> Feb 02 15:22:27 kvm3j kernel: drain_array_locked.constprop.0+0x2e/0x80
>> Feb 02 15:22:27 kvm3j kernel: drain_array.constprop.0+0x54/0x70
>> Feb 02 15:22:27 kvm3j kernel: cache_reap+0x6c/0x100
>> Feb 02 15:22:27 kvm3j kernel: process_one_work+0x1cf/0x360
>> Feb 02 15:22:27 kvm3j kernel: worker_thread+0x45/0x3a0
>> Feb 02 15:22:27 kvm3j kernel: ? process_one_work+0x360/0x360
>> Feb 02 15:22:27 kvm3j kernel: kthread+0x116/0x130
>> Feb 02 15:22:27 kvm3j kernel: ? kthread_create_worker_on_cpu+0x40/0x40
>> Feb 02 15:22:27 kvm3j kernel: ret_from_fork+0x22/0x30
>> Feb 02 15:22:27 kvm3j kernel: Modules linked in: hpilo
>> Feb 02 15:22:27 kvm3j kernel: CR2: ffffebde00000008
>> Feb 02 15:22:27 kvm3j kernel: ---[ end trace ded3153d86a92898 ]---
>> Feb 02 15:22:27 kvm3j kernel: RIP: 0010:free_block.constprop.0+0xc0/0x1f0
>> Feb 02 15:22:27 kvm3j kernel: Code: 4c 8b 16 4c 89 d0 48 01 e8 0f 82 32 01 00 00 4c 89 f2 48 bb 00 00 00 00 00 ea ff ff 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 01 d8 <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 >
>> Feb 02 15:22:27 kvm3j kernel: RSP: 0018:ffffc9000252bdc8 EFLAGS: 00010086
>> Feb 02 15:22:27 kvm3j kernel: RAX: ffffebde00000000 RBX: ffffea0000000000 RCX: ffff888889141b00
>> Feb 02 15:22:27 kvm3j kernel: RDX: 0000777f80000000 RSI: ffff893d3edf3400 RDI: ffff8881000403c0
>> Feb 02 15:22:27 kvm3j kernel: RBP: 0000000080000000 R08: ffff888100041300 R09: 0000000000000003
>> Feb 02 15:22:27 kvm3j kernel: R10: 0000000000000000 R11: ffff888100041308 R12: dead000000000122
>> Feb 02 15:22:27 kvm3j kernel: R13: dead000000000100 R14: 0000777f80000000 R15: ffff893ed8780d60
>> Feb 02 15:22:27 kvm3j kernel: FS: 0000000000000000(0000) GS:ffff893d3edc0000(0000) knlGS:0000000000000000
>> Feb 02 15:22:27 kvm3j kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Feb 02 15:22:27 kvm3j kernel: CR2: ffffebde00000008 CR3: 000000048c4aa002 CR4: 00000000001726e0
>
> Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)
>
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> on my table. I can only look briefly into most of them. Unfortunately
> therefore I sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to
> tell me about it in a public reply, that's in everyone's interest.
>
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
>
> ---
> Additional information about regzbot:
>
> If you want to know more about regzbot, check out its web-interface, the
> getting start guide, and/or the references documentation:
>
> https://linux-regtracking.leemhuis.info/regzbot/
> https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
> https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md
>
> The last two documents will explain how you can interact with regzbot
> yourself if your want to.
>
> Hint for reporters: when reporting a regression it's in your interest to
> tell #regzbot about it in the report, as that will ensure the regression
> gets on the radar of regzbot and the regression tracker. That's in your
> interest, as they will make sure the report won't fall through the
> cracks unnoticed.
>
> Hint for developers: you normally don't need to care about regzbot once
> it's involved. Fix the issue as you normally would, just remember to
> include a 'Link:' tag to the report in the commit message, as explained
> in Documentation/process/submitting-patches.rst
> That aspect was recently was made more explicit in commit 1f57bd42b77c:
> https://git.kernel.org/linus/1f57bd42b77c
This is the start of the stable review cycle for the 4.19.230 release.
There are 49 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 16 Feb 2022 09:24:36 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.230-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.230-rc1
Song Liu <song(a)kernel.org>
perf: Fix list corruption in perf_cgroup_switch()
Armin Wolf <W_Armin(a)gmx.de>
hwmon: (dell-smm) Speed up setting of fan speed
Kees Cook <keescook(a)chromium.org>
seccomp: Invalidate seccomp mode to catch death failures
Johan Hovold <johan(a)kernel.org>
USB: serial: cp210x: add CPI Bulk Coin Recycler id
Johan Hovold <johan(a)kernel.org>
USB: serial: cp210x: add NCR Retail IO box id
Stephan Brunner <s.brunner(a)stephan-brunner.net>
USB: serial: ch341: add support for GW Instek USB2.0-Serial devices
Pawel Dembicki <paweldembicki(a)gmail.com>
USB: serial: option: add ZTE MF286D modem
Cameron Williams <cang1(a)live.co.uk>
USB: serial: ftdi_sio: add support for Brainboxes US-159/235/320
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
usb: gadget: rndis: check size of RNDIS_MSG_SET command
Szymon Heidrich <szymon.heidrich(a)gmail.com>
USB: gadget: validate interface OS descriptor requests
Udipto Goswami <quic_ugoswami(a)quicinc.com>
usb: dwc3: gadget: Prevent core from processing stale TRBs
Sean Anderson <sean.anderson(a)seco.com>
usb: ulpi: Call of_node_put correctly
Sean Anderson <sean.anderson(a)seco.com>
usb: ulpi: Move of_node_put to ulpi_dev_release
TATSUKAWA KOSUKE (立川 江介) <tatsu-ab1(a)nec.com>
n_tty: wake up poll(POLLRDNORM) on receiving data
Jakob Koschel <jakobkoschel(a)gmail.com>
vt_ioctl: add array_index_nospec to VT_ACTIVATE
Jakob Koschel <jakobkoschel(a)gmail.com>
vt_ioctl: fix array_index_nospec in vt_setactivate
Raju Rangoju <Raju.Rangoju(a)amd.com>
net: amd-xgbe: disable interrupts during pci removal
Jon Maloy <jmaloy(a)redhat.com>
tipc: rate limit warning for received illegal binding update
Eric Dumazet <edumazet(a)google.com>
veth: fix races around rq->rx_notify_masked
Antoine Tenart <atenart(a)kernel.org>
net: fix a memleak when uncloning an skb dst and its metadata
Antoine Tenart <atenart(a)kernel.org>
net: do not keep the dst cache when uncloning an skb dst and its metadata
Eric Dumazet <edumazet(a)google.com>
ipmr,ip6mr: acquire RTNL before calling ip[6]mr_free_table() on failure path
Mahesh Bandewar <maheshb(a)google.com>
bonding: pair enable_port with slave_arr_updates
Samuel Mendoza-Jonas <samjonas(a)amazon.com>
ixgbevf: Require large buffers for build_skb on 82599VF
Udipto Goswami <quic_ugoswami(a)quicinc.com>
usb: f_fs: Fix use-after-free for epfile
Fabio Estevam <festevam(a)gmail.com>
ARM: dts: imx6qdl-udoo: Properly describe the SD card detect
Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
staging: fbtft: Fix error path in fbtft_driver_module_init()
Martin Blumenstingl <martin.blumenstingl(a)googlemail.com>
ARM: dts: meson: Fix the UART compatible strings
Zechuan Chen <chenzechuan1(a)huawei.com>
perf probe: Fix ppc64 'perf probe add events failed' case
Nikolay Aleksandrov <nikolay(a)cumulusnetworks.com>
net: bridge: fix stale eth hdr pointer in br_dev_xmit
Fabio Estevam <festevam(a)gmail.com>
ARM: dts: imx23-evk: Remove MX23_PAD_SSP1_DETECT from hog group
Daniel Borkmann <daniel(a)iogearbox.net>
bpf: Add kconfig knob for disabling unpriv bpf by default
Jisheng Zhang <jszhang(a)kernel.org>
net: stmmac: dwmac-sun8i: use return val of readl_poll_timeout()
Amelie Delaunay <amelie.delaunay(a)foss.st.com>
usb: dwc2: gadget: don't try to disable ep0 in dwc2_hsotg_suspend
ZouMingzhe <mingzhe.zou(a)easystack.cn>
scsi: target: iscsi: Make sure the np under each tpg is unique
Victor Nogueira <victor(a)mojatatu.com>
net: sched: Clarify error message when qdisc kind is unknown
Olga Kornievskaia <kolga(a)netapp.com>
NFSv4 expose nfs_parse_server_name function
Olga Kornievskaia <kolga(a)netapp.com>
NFSv4 remove zero number of fs_locations entries error check
Trond Myklebust <trond.myklebust(a)hammerspace.com>
NFSv4.1: Fix uninitialised variable in devicenotify
Xiaoke Wang <xkernel.wang(a)foxmail.com>
nfs: nfs4clinet: check the return value of kstrdup()
Olga Kornievskaia <kolga(a)netapp.com>
NFSv4 only print the label when its queried
Chuck Lever <chuck.lever(a)oracle.com>
NFSD: Fix offset type in I/O trace points
Chuck Lever <chuck.lever(a)oracle.com>
NFSD: Clamp WRITE offsets
Trond Myklebust <trond.myklebust(a)hammerspace.com>
NFS: Fix initialisation of nfs_client cl_flags field
Pavel Parkhomenko <Pavel.Parkhomenko(a)baikalelectronics.ru>
net: phy: marvell: Fix MDI-x polarity setting in 88e1118-compatible PHYs
Jiasheng Jiang <jiasheng(a)iscas.ac.cn>
mmc: sdhci-of-esdhc: Check for error num after setting mask
Roberto Sassu <roberto.sassu(a)huawei.com>
ima: Allow template selection with ima_template[_fmt]= after ima_hash=
Stefan Berger <stefanb(a)linux.ibm.com>
ima: Remove ima_policy file before directory
Xiaoke Wang <xkernel.wang(a)foxmail.com>
integrity: check the return value of audit_log_start()
-------------
Diffstat:
Documentation/sysctl/kernel.txt | 21 +++++++++
Makefile | 4 +-
arch/arm/boot/dts/imx23-evk.dts | 1 -
arch/arm/boot/dts/imx6qdl-udoo.dtsi | 5 +-
arch/arm/boot/dts/meson.dtsi | 8 ++--
drivers/hwmon/dell-smm-hwmon.c | 12 +++--
drivers/mmc/host/sdhci-of-esdhc.c | 8 +++-
drivers/net/bonding/bond_3ad.c | 3 +-
drivers/net/ethernet/amd/xgbe/xgbe-pci.c | 3 ++
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 13 +++---
drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 2 +-
drivers/net/phy/marvell.c | 7 ++-
drivers/net/veth.c | 13 ++++--
drivers/staging/fbtft/fbtft.h | 5 +-
drivers/target/iscsi/iscsi_target_tpg.c | 3 ++
drivers/tty/n_tty.c | 4 +-
drivers/tty/vt/vt_ioctl.c | 5 +-
drivers/usb/common/ulpi.c | 10 ++--
drivers/usb/dwc2/gadget.c | 2 +-
drivers/usb/dwc3/gadget.c | 13 ++++++
drivers/usb/gadget/composite.c | 3 ++
drivers/usb/gadget/function/f_fs.c | 56 +++++++++++++++++------
drivers/usb/gadget/function/rndis.c | 9 ++--
drivers/usb/serial/ch341.c | 1 +
drivers/usb/serial/cp210x.c | 2 +
drivers/usb/serial/ftdi_sio.c | 3 ++
drivers/usb/serial/ftdi_sio_ids.h | 3 ++
drivers/usb/serial/option.c | 2 +
fs/nfs/callback.h | 2 +-
fs/nfs/callback_proc.c | 2 +-
fs/nfs/callback_xdr.c | 18 ++++----
fs/nfs/client.c | 2 +-
fs/nfs/nfs4_fs.h | 3 +-
fs/nfs/nfs4client.c | 5 +-
fs/nfs/nfs4namespace.c | 4 +-
fs/nfs/nfs4state.c | 3 ++
fs/nfs/nfs4xdr.c | 9 ++--
fs/nfsd/nfs3proc.c | 5 ++
fs/nfsd/nfs4proc.c | 5 +-
fs/nfsd/trace.h | 14 +++---
include/net/dst_metadata.h | 14 +++++-
init/Kconfig | 10 ++++
kernel/bpf/syscall.c | 3 +-
kernel/events/core.c | 4 +-
kernel/seccomp.c | 10 ++++
kernel/sysctl.c | 29 ++++++++++--
net/bridge/br_device.c | 6 +--
net/ipv4/ipmr.c | 2 +
net/ipv6/ip6mr.c | 2 +
net/sched/sch_api.c | 2 +-
net/tipc/name_distr.c | 2 +-
security/integrity/ima/ima_fs.c | 2 +-
security/integrity/ima/ima_template.c | 10 ++--
security/integrity/integrity_audit.c | 2 +
tools/perf/util/probe-event.c | 3 ++
55 files changed, 289 insertions(+), 105 deletions(-)
Most eDP panel functions only work correctly when the panel is not in
self-refresh. In particular, analogix_dp_bridge_disable() tends to hit
AUX channel errors if the panel is in self-refresh.
Given the above, it appears that so far, this driver assumes that we are
never in self-refresh when it comes time to fully disable the bridge.
Prior to commit 846c7dfc1193 ("drm/atomic: Try to preserve the crtc
enabled state in drm_atomic_remove_fb, v2."), this tended to be true,
because we would automatically disable the pipe when framebuffers were
removed, and so we'd typically disable the bridge shortly after the last
display activity.
However, that is not guaranteed: an idle (self-refresh) display pipe may
be disabled, e.g., when switching CRTCs. We need to exit PSR first.
Stable notes: this is definitely a bugfix, and the bug has likely
existed in some form for quite a while. It may predate the "PSR helpers"
refactor, but the code looked very different before that, and it's
probably not worth rewriting the fix.
Cc: <stable(a)vger.kernel.org>
Fixes: 6c836d965bad ("drm/rockchip: Use the helpers for PSR")
Signed-off-by: Brian Norris <briannorris(a)chromium.org>
---
.../drm/bridge/analogix/analogix_dp_core.c | 42 +++++++++++++++++--
1 file changed, 38 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c b/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c
index b7d2e4449cfa..6ee0f62a7161 100644
--- a/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c
+++ b/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c
@@ -1268,6 +1268,25 @@ static int analogix_dp_bridge_attach(struct drm_bridge *bridge,
return 0;
}
+static
+struct drm_crtc *analogix_dp_get_old_crtc(struct analogix_dp_device *dp,
+ struct drm_atomic_state *state)
+{
+ struct drm_encoder *encoder = dp->encoder;
+ struct drm_connector *connector;
+ struct drm_connector_state *conn_state;
+
+ connector = drm_atomic_get_old_connector_for_encoder(state, encoder);
+ if (!connector)
+ return NULL;
+
+ conn_state = drm_atomic_get_old_connector_state(state, connector);
+ if (!conn_state)
+ return NULL;
+
+ return conn_state->crtc;
+}
+
static
struct drm_crtc *analogix_dp_get_new_crtc(struct analogix_dp_device *dp,
struct drm_atomic_state *state)
@@ -1448,14 +1467,16 @@ analogix_dp_bridge_atomic_disable(struct drm_bridge *bridge,
{
struct drm_atomic_state *old_state = old_bridge_state->base.state;
struct analogix_dp_device *dp = bridge->driver_private;
- struct drm_crtc *crtc;
+ struct drm_crtc *old_crtc, *new_crtc;
+ struct drm_crtc_state *old_crtc_state = NULL;
struct drm_crtc_state *new_crtc_state = NULL;
+ int ret;
- crtc = analogix_dp_get_new_crtc(dp, old_state);
- if (!crtc)
+ new_crtc = analogix_dp_get_new_crtc(dp, old_state);
+ if (!new_crtc)
goto out;
- new_crtc_state = drm_atomic_get_new_crtc_state(old_state, crtc);
+ new_crtc_state = drm_atomic_get_new_crtc_state(old_state, new_crtc);
if (!new_crtc_state)
goto out;
@@ -1464,6 +1485,19 @@ analogix_dp_bridge_atomic_disable(struct drm_bridge *bridge,
return;
out:
+ old_crtc = analogix_dp_get_old_crtc(dp, old_state);
+ if (old_crtc) {
+ old_crtc_state = drm_atomic_get_old_crtc_state(old_state,
+ old_crtc);
+
+ /* When moving from PSR to fully disabled, exit PSR first. */
+ if (old_crtc_state && old_crtc_state->self_refresh_active) {
+ ret = analogix_dp_disable_psr(dp);
+ if (ret)
+ DRM_ERROR("Failed to disable psr (%d)\n", ret);
+ }
+ }
+
analogix_dp_bridge_disable(bridge);
}
--
2.35.1.265.g69c8d7142f-goog