[PATCH AUTOSEL 5.16 01/28] selftests/bpf: Add test for bpf_timer overwriting crash

List overview All Threads
Download

newer

older

[PATCH 5.15 000/667] 5.15.46-rc1...

[to-be-updated]...

Sasha Levin

1 Mar 2022 1 Mar '22

8:13 p.m.

From: Kumar Kartikeya Dwivedi memxor@gmail.com

[ Upstream commit a7e75016a0753c24d6c995bc02501ae35368e333 ]

Add a test that validates that timer value is not overwritten when doing a copy_map_value call in the kernel. Without the prior fix, this test triggers a crash.

Signed-off-by: Kumar Kartikeya Dwivedi memxor@gmail.com Signed-off-by: Alexei Starovoitov ast@kernel.org Link: https://lore.kernel.org/bpf/20220209070324.1093182-3-memxor@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- .../selftests/bpf/prog_tests/timer_crash.c | 32 +++++++++++ .../testing/selftests/bpf/progs/timer_crash.c | 54 +++++++++++++++++++ 2 files changed, 86 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/timer_crash.c create mode 100644 tools/testing/selftests/bpf/progs/timer_crash.c

diff --git a/tools/testing/selftests/bpf/prog_tests/timer_crash.c b/tools/testing/selftests/bpf/prog_tests/timer_crash.c new file mode 100644 index 0000000000000..f74b82305da8c --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/timer_crash.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <test_progs.h> +#include "timer_crash.skel.h" + +enum { + MODE_ARRAY, + MODE_HASH, +}; + +static void test_timer_crash_mode(int mode) +{ + struct timer_crash *skel; + + skel = timer_crash__open_and_load(); + if (!ASSERT_OK_PTR(skel, "timer_crash__open_and_load")) + return; + skel->bss->pid = getpid(); + skel->bss->crash_map = mode; + if (!ASSERT_OK(timer_crash__attach(skel), "timer_crash__attach")) + goto end; + usleep(1); +end: + timer_crash__destroy(skel); +} + +void test_timer_crash(void) +{ + if (test__start_subtest("array")) + test_timer_crash_mode(MODE_ARRAY); + if (test__start_subtest("hash")) + test_timer_crash_mode(MODE_HASH); +} diff --git a/tools/testing/selftests/bpf/progs/timer_crash.c b/tools/testing/selftests/bpf/progs/timer_crash.c new file mode 100644 index 0000000000000..f8f7944e70dae --- /dev/null +++ b/tools/testing/selftests/bpf/progs/timer_crash.c @@ -0,0 +1,54 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <vmlinux.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_helpers.h> + +struct map_elem { + struct bpf_timer timer; + struct bpf_spin_lock lock; +}; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, int); + __type(value, struct map_elem); +} amap SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 1); + __type(key, int); + __type(value, struct map_elem); +} hmap SEC(".maps"); + +int pid = 0; +int crash_map = 0; /* 0 for amap, 1 for hmap */ + +SEC("fentry/do_nanosleep") +int sys_enter(void *ctx) +{ + struct map_elem *e, value = {}; + void *map = crash_map ? (void *)&hmap : (void *)&amap; + + if (bpf_get_current_task_btf()->tgid != pid) + return 0; + + *(void **)&value = (void *)0xdeadcaf3; + + bpf_map_update_elem(map, &(int){0}, &value, 0); + /* For array map, doing bpf_map_update_elem will do a + * check_and_free_timer_in_array, which will trigger the crash if timer + * pointer was overwritten, for hmap we need to use bpf_timer_cancel. + */ + if (crash_map == 1) { + e = bpf_map_lookup_elem(map, &(int){0}); + if (!e) + return 0; + bpf_timer_cancel(&e->timer); + } + return 0; +} + +char _license[] SEC("license") = "GPL";

-- 2.34.1

Show replies by date

Sasha Levin

1 Mar 1 Mar

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 02/28] swiotlb: fix info leak with DMA_FROM_DEVICE

From: Halil Pasic pasic@linux.ibm.com

[ Upstream commit ddbd89deb7d32b1fbb879f48d68fda1a8ac58e8e ]

The problem I'm addressing was discovered by the LTP test covering cve-2018-1000204.

A short description of what happens follows: 1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV and a corresponding dxferp. The peculiar thing about this is that TUR is not reading from the device. 2) In sg_start_req() the invocation of blk_rq_map_user() effectively bounces the user-space buffer. As if the device was to transfer into it. Since commit a45b599ad808 ("scsi: sg: allocate with __GFP_ZERO in sg_build_indirect()") we make sure this first bounce buffer is allocated with GFP_ZERO. 3) For the rest of the story we keep ignoring that we have a TUR, so the device won't touch the buffer we prepare as if the we had a DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device and the buffer allocated by SG is mapped by the function virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here scatter-gather and not scsi generics). This mapping involves bouncing via the swiotlb (we need swiotlb to do virtio in protected guest like s390 Secure Execution, or AMD SEV). 4) When the SCSI TUR is done, we first copy back the content of the second (that is swiotlb) bounce buffer (which most likely contains some previous IO data), to the first bounce buffer, which contains all zeros. Then we copy back the content of the first bounce buffer to the user-space buffer. 5) The test case detects that the buffer, which it zero-initialized, ain't all zeros and fails.

One can argue that this is an swiotlb problem, because without swiotlb we leak all zeros, and the swiotlb should be transparent in a sense that it does not affect the outcome (if all other participants are well behaved).

Copying the content of the original buffer into the swiotlb buffer is the only way I can think of to make swiotlb transparent in such scenarios. So let's do just that if in doubt, but allow the driver to tell us that the whole mapped buffer is going to be overwritten, in which case we can preserve the old behavior and avoid the performance impact of the extra bounce.

Signed-off-by: Halil Pasic pasic@linux.ibm.com Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org --- Documentation/core-api/dma-attributes.rst | 8 ++++++++ include/linux/dma-mapping.h | 8 ++++++++ kernel/dma/swiotlb.c | 3 ++- 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst index 1887d92e8e926..17706dc91ec9f 100644 --- a/Documentation/core-api/dma-attributes.rst +++ b/Documentation/core-api/dma-attributes.rst @@ -130,3 +130,11 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged subsystem that the buffer is fully accessible at the elevated privilege level (and ideally inaccessible or at least read-only at the lesser-privileged levels). + +DMA_ATTR_OVERWRITE +------------------ + +This is a hint to the DMA-mapping subsystem that the device is expected to +overwrite the entire mapped size, thus the caller does not require any of the +previous buffer contents to be preserved. This allows bounce-buffering +implementations to optimise DMA_FROM_DEVICE transfers. diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index dca2b1355bb13..6150d11a607e1 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -61,6 +61,14 @@ */ #define DMA_ATTR_PRIVILEGED (1UL << 9)

+/* + * This is a hint to the DMA-mapping subsystem that the device is expected + * to overwrite the entire mapped size, thus the caller does not require any + * of the previous buffer contents to be preserved. This allows + * bounce-buffering implementations to optimise DMA_FROM_DEVICE transfers. + */ +#define DMA_ATTR_OVERWRITE (1UL << 10) + /* * A dma_addr_t can hold any valid DMA or bus address for the platform. It can * be given to a device to use as a DMA source or target. It is specific to a diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 8e840fbbed7c7..d958b1201092c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -582,7 +582,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr, mem->slots[index + i].orig_addr = slot_addr(orig_addr, i); tlb_addr = slot_addr(mem->start, index) + offset; if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && - (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)) + (!(attrs & DMA_ATTR_OVERWRITE) || dir == DMA_TO_DEVICE || + dir == DMA_BIDIRECTIONAL)) swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE); return tlb_addr; }

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 03/28] usb: dwc3: pci: add support for the Intel Raptor Lake-S

From: Heikki Krogerus heikki.krogerus@linux.intel.com

[ Upstream commit 038438a25c45d5ac996e95a22fa9e76ff3d1f8c7 ]

This patch adds the necessary PCI ID for Intel Raptor Lake-S devices.

Signed-off-by: Heikki Krogerus heikki.krogerus@linux.intel.com Link: https://lore.kernel.org/r/20220214141948.18637-1-heikki.krogerus@linux.intel... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/usb/dwc3/dwc3-pci.c | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/drivers/usb/dwc3/dwc3-pci.c b/drivers/usb/dwc3/dwc3-pci.c index 7ff8fc8f79a9b..4e69a9d829f23 100644 --- a/drivers/usb/dwc3/dwc3-pci.c +++ b/drivers/usb/dwc3/dwc3-pci.c @@ -43,6 +43,7 @@ #define PCI_DEVICE_ID_INTEL_ADLP 0x51ee #define PCI_DEVICE_ID_INTEL_ADLM 0x54ee #define PCI_DEVICE_ID_INTEL_ADLS 0x7ae1 +#define PCI_DEVICE_ID_INTEL_RPLS 0x7a61 #define PCI_DEVICE_ID_INTEL_TGL 0x9a15 #define PCI_DEVICE_ID_AMD_MR 0x163a

@@ -409,6 +410,9 @@ static const struct pci_device_id dwc3_pci_id_table[] = { { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_ADLS), (kernel_ulong_t) &dwc3_pci_intel_swnode, },

+ { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_RPLS), + (kernel_ulong_t) &dwc3_pci_intel_swnode, }, + { PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_TGL), (kernel_ulong_t) &dwc3_pci_intel_swnode, },

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 04/28] pinctrl: tigerlake: Revert "Add Alder Lake-M ACPI ID"

From: Andy Shevchenko andriy.shevchenko@linux.intel.com

[ Upstream commit 6f66db29e2415cbe8759c48584f9cae19b3c2651 ]

It appears that last minute change moved ACPI ID of Alder Lake-M to the INTC1055, which is already in the driver.

This ID on the other hand will be used elsewhere.

This reverts commit 258435a1c8187f559549e515d2f77fa0b57bcd27.

Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/pinctrl/intel/pinctrl-tigerlake.c | 1 - 1 file changed, 1 deletion(-)

diff --git a/drivers/pinctrl/intel/pinctrl-tigerlake.c b/drivers/pinctrl/intel/pinctrl-tigerlake.c index 0bcd19597e4ad..3ddaeffc04150 100644 --- a/drivers/pinctrl/intel/pinctrl-tigerlake.c +++ b/drivers/pinctrl/intel/pinctrl-tigerlake.c @@ -749,7 +749,6 @@ static const struct acpi_device_id tgl_pinctrl_acpi_match[] = { { "INT34C5", (kernel_ulong_t)&tgllp_soc_data }, { "INT34C6", (kernel_ulong_t)&tglh_soc_data }, { "INTC1055", (kernel_ulong_t)&tgllp_soc_data }, - { "INTC1057", (kernel_ulong_t)&tgllp_soc_data }, { } }; MODULE_DEVICE_TABLE(acpi, tgl_pinctrl_acpi_match);

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 05/28] KVM: Fix lockdep false negative during host resume

From: Wanpeng Li wanpengli@tencent.com

[ Upstream commit 4cb9a998b1ce25fad74a82f5a5c45a4ef40de337 ]

I saw the below splatting after the host suspended and resumed.

WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm] CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G W IOE 5.17.0-rc3+ #4 RIP: 0010:kvm_resume+0x2c/0x30 [kvm] Call Trace: <TASK> syscore_resume+0x90/0x340 suspend_devices_and_enter+0xaee/0xe90 pm_suspend.cold+0x36b/0x3c2 state_store+0x82/0xf0 kernfs_fop_write_iter+0x1b6/0x260 new_sync_write+0x258/0x370 vfs_write+0x33f/0x510 ksys_write+0xc9/0x160 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae

lockdep_is_held() can return -1 when lockdep is disabled which triggers this warning. Let's use lockdep_assert_not_held() which can detect incorrect calls while holding a lock and it also avoids false negatives when lockdep is disabled.

Signed-off-by: Wanpeng Li wanpengli@tencent.com Message-Id: 1644920142-81249-1-git-send-email-wanpengli@tencent.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- virt/kvm/kvm_main.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 71ddc7a8bc302..6ae9e04d0585e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5347,9 +5347,7 @@ static int kvm_suspend(void) static void kvm_resume(void) { if (kvm_usage_count) { -#ifdef CONFIG_LOCKDEP - WARN_ON(lockdep_is_held(&kvm_count_lock)); -#endif + lockdep_assert_not_held(&kvm_count_lock); hardware_enable_nolock(NULL); } }

-- 2.34.1

Paolo Bonzini

8:19 p.m.

New subject: [PATCH AUTOSEL 5.16 05/28] KVM: Fix lockdep false negative during host resume

On 3/1/22 21:13, Sasha Levin wrote:

...

From: Wanpeng Li wanpengli@tencent.com

[ Upstream commit 4cb9a998b1ce25fad74a82f5a5c45a4ef40de337 ]

I saw the below splatting after the host suspended and resumed.
WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm]
CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G        W IOE     5.17.0-rc3+ #4
RIP: 0010:kvm_resume+0x2c/0x30 [kvm]
Call Trace:
 <TASK>
 syscore_resume+0x90/0x340
 suspend_devices_and_enter+0xaee/0xe90
 pm_suspend.cold+0x36b/0x3c2
 state_store+0x82/0xf0
 kernfs_fop_write_iter+0x1b6/0x260
 new_sync_write+0x258/0x370
 vfs_write+0x33f/0x510
 ksys_write+0xc9/0x160
 do_syscall_64+0x3b/0xc0
 entry_SYSCALL_64_after_hwframe+0x44/0xae
lockdep_is_held() can return -1 when lockdep is disabled which triggers this warning. Let's use lockdep_assert_not_held() which can detect incorrect calls while holding a lock and it also avoids false negatives when lockdep is disabled.

Signed-off-by: Wanpeng Li wanpengli@tencent.com Message-Id: 1644920142-81249-1-git-send-email-wanpengli@tencent.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org

virt/kvm/kvm_main.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 71ddc7a8bc302..6ae9e04d0585e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5347,9 +5347,7 @@ static int kvm_suspend(void) static void kvm_resume(void) { if (kvm_usage_count) { -#ifdef CONFIG_LOCKDEP
WARN_ON(lockdep_is_held(&kvm_count_lock));
-#endif
lockdep_assert_not_held(&kvm_count_lock);
hardware_enable_nolock(NULL); } }

Acked-by: Paolo Bonzini pbonzini@redhat.com

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 06/28] kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode

From: Anton Romanov romanton@google.com

[ Upstream commit 3a55f729240a686aa8af00af436306c0cd532522 ]

If vcpu has tsc_always_catchup set each request updates pvclock data. KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on host's side and do hypercall inside pvclock_read_retry loop leading to infinite loop in such situation.

v3: Removed warn Changed return code to KVM_EFAULT v2: Added warn

Signed-off-by: Anton Romanov romanton@google.com Message-Id: 20220216182653.506850-1-romanton@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kvm/x86.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0714fa0e7ede0..18fc0367ef21a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8769,6 +8769,13 @@ static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr, if (clock_type != KVM_CLOCK_PAIRING_WALLCLOCK) return -KVM_EOPNOTSUPP;

+ /* + * When tsc is in permanent catchup mode guests won't be able to use + * pvclock_read_retry loop to get consistent view of pvclock + */ + if (vcpu->arch.tsc_always_catchup) + return -KVM_EOPNOTSUPP; + if (!kvm_get_walltime_and_clockread(&ts, &cycle)) return -KVM_EOPNOTSUPP;

-- 2.34.1

Paolo Bonzini

8:19 p.m.

New subject: [PATCH AUTOSEL 5.16 06/28] kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode

On 3/1/22 21:13, Sasha Levin wrote:

...

From: Anton Romanov romanton@google.com

[ Upstream commit 3a55f729240a686aa8af00af436306c0cd532522 ]

If vcpu has tsc_always_catchup set each request updates pvclock data. KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on host's side and do hypercall inside pvclock_read_retry loop leading to infinite loop in such situation.

v3: Removed warn Changed return code to KVM_EFAULT v2: Added warn

Signed-off-by: Anton Romanov romanton@google.com Message-Id: 20220216182653.506850-1-romanton@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org

arch/x86/kvm/x86.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0714fa0e7ede0..18fc0367ef21a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8769,6 +8769,13 @@ static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr, if (clock_type != KVM_CLOCK_PAIRING_WALLCLOCK) return -KVM_EOPNOTSUPP;
/*
* When tsc is in permanent catchup mode guests won't be able to use
* pvclock_read_retry loop to get consistent view of pvclock
*/
if (vcpu->arch.tsc_always_catchup)
return -KVM_EOPNOTSUPP;
if (!kvm_get_walltime_and_clockread(&ts, &cycle)) return -KVM_EOPNOTSUPP;

Acked-by: Paolo Bonzini pbonzini@redhat.com

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

From: Leonardo Bras leobras@redhat.com

[ Upstream commit ad856280ddea3401e1f5060ef20e6de9f6122c76 ]

During host/guest switch (like in kvm_arch_vcpu_ioctl_run()), the kernel swaps the fpu between host/guest contexts, by using fpu_swap_kvm_fpstate().

When xsave feature is available, the fpu swap is done by: - xsave(s) instruction, with guest's fpstate->xfeatures as mask, is used to store the current state of the fpu registers to a buffer. - xrstor(s) instruction, with (fpu_kernel_cfg.max_features & XFEATURE_MASK_FPSTATE) as mask, is used to put the buffer into fpu regs.

For xsave(s) the mask is used to limit what parts of the fpu regs will be copied to the buffer. Likewise on xrstor(s), the mask is used to limit what parts of the fpu regs will be changed.

The mask for xsave(s), the guest's fpstate->xfeatures, is defined on kvm_arch_vcpu_create(), which (in summary) sets it to all features supported by the cpu which are enabled on kernel config.

This means that xsave(s) will save to guest buffer all the fpu regs contents the cpu has enabled when the guest is paused, even if they are not used.

This would not be an issue, if xrstor(s) would also do that.

xrstor(s)'s mask for host/guest swap is basically every valid feature contained in kernel config, except XFEATURE_MASK_PKRU. Accordingto kernel src, it is instead switched in switch_to() and flush_thread().

Then, the following happens with a host supporting PKRU starts a guest that does not support it: 1 - Host has XFEATURE_MASK_PKRU set. 1st switch to guest, 2 - xsave(s) fpu regs to host fpustate (buffer has XFEATURE_MASK_PKRU) 3 - xrstor(s) guest fpustate to fpu regs (fpu regs have XFEATURE_MASK_PKRU) 4 - guest runs, then switch back to host, 5 - xsave(s) fpu regs to guest fpstate (buffer now have XFEATURE_MASK_PKRU) 6 - xrstor(s) host fpstate to fpu regs. 7 - kvm_vcpu_ioctl_x86_get_xsave() copy guest fpstate to userspace (with XFEATURE_MASK_PKRU, which should not be supported by guest vcpu)

On 5, even though the guest does not support PKRU, it does have the flag set on guest fpstate, which is transferred to userspace via vcpu ioctl KVM_GET_XSAVE.

This becomes a problem when the user decides on migrating the above guest to another machine that does not support PKRU: the new host restores guest's fpu regs to as they were before (xrstor(s)), but since the new host don't support PKRU, a general-protection exception ocurs in xrstor(s) and that crashes the guest.

This can be solved by making the guest's fpstate->user_xfeatures hold a copy of guest_supported_xcr0. This way, on 7 the only flags copied to userspace will be the ones compatible to guest requirements, and thus there will be no issue during migration.

As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2.

Also, since kvm_vcpu_after_set_cpuid() now sets fpstate->user_xfeatures, there is not need to set it in kvm_check_cpuid(). So, change fpstate_realloc() so it does not touch fpstate->user_xfeatures if a non-NULL guest_fpu is passed, which is the case when kvm_check_cpuid() calls it.

Signed-off-by: Leonardo Bras leobras@redhat.com Message-Id: 20220217053028.96432-2-leobras@redhat.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kernel/fpu/xstate.c | 5 ++++- arch/x86/kvm/cpuid.c | 2 ++ 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index d28829403ed08..6ac01f9828530 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1563,7 +1563,10 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize, fpregs_restore_userregs();

newfps->xfeatures = curfps->xfeatures | xfeatures; - newfps->user_xfeatures = curfps->user_xfeatures | xfeatures; + + if (!guest_fpu) + newfps->user_xfeatures = curfps->user_xfeatures | xfeatures; + newfps->xfd = curfps->xfd & ~xfeatures;

curfps = fpu_install_fpstate(fpu, newfps); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index bf18679757c70..875dce4aa2d28 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -276,6 +276,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);

+ vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0; + kvm_update_pv_runtime(vcpu);

vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);

-- 2.34.1

Paolo Bonzini

8:22 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On 3/1/22 21:13, Sasha Levin wrote:

...

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index d28829403ed08..6ac01f9828530 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1563,7 +1563,10 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize, fpregs_restore_userregs(); newfps->xfeatures = curfps->xfeatures | xfeatures;

newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
if (!guest_fpu)
newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
newfps->xfd = curfps->xfd & ~xfeatures;
curfps = fpu_install_fpstate(fpu, newfps); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index bf18679757c70..875dce4aa2d28 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -276,6 +276,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);

vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0;

kvm_update_pv_runtime(vcpu);

vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);

Leonardo, was this also buggy in 5.16? (I should have asked for a Fixes tag...).

Paolo

Peter Xu

3 Jun 3 Jun

6:40 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On Tue, Mar 01, 2022 at 09:22:10PM +0100, Paolo Bonzini wrote:

...

On 3/1/22 21:13, Sasha Levin wrote:

...
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index d28829403ed08..6ac01f9828530 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1563,7 +1563,10 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize, fpregs_restore_userregs(); newfps->xfeatures = curfps->xfeatures | xfeatures;

newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
if (!guest_fpu)
newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
newfps->xfd = curfps->xfd & ~xfeatures; curfps = fpu_install_fpstate(fpu, newfps);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index bf18679757c70..875dce4aa2d28 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -276,6 +276,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);

vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0;

kvm_update_pv_runtime(vcpu); vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
Leonardo, was this also buggy in 5.16? (I should have asked for a Fixes tag...).

I just stumbled over this patch on some migration tests in the past few days..

In short, I was migrating a VM from 5.15 host to 5.18 host and the guest trigger double fault immediately after the switch-over (I think that's when it's trying to do vmenter, a VECTOR_DF was injected), with either precopy or postcopy.

After I upgrade 5.15 src host to 5.18 host, problem goes away. I did a bisect on dest and surprisingly it points to this commit.

Side note: I'm using two hosts that have the same processor model, so no case of missing features on either side - they just match.

I'm not really sure whether this is a bug or by design - do we require this patch to be applied to all stable branches to make the guest not crash after migration, or it is unexpected?

FWICT, this patch modifies user_xfeatures while we don't do that trick before. It sounds reasonable to me from the 1st glance, say if the guest didn't enable some of the fpu features so we don't need to migrate those fpu state chunks as we're migrating things based on user_xfeatures, and it sounds good to solve the migration issue on "has-pksu" host to "no-pksu" host as described in the patch commit message.

However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.

I think what it should be happening is user_xfeatures will be set on src with 0x7 (old kernel), so we should have migrated some more chunks to dest, but I just don't quickly understand why that's a problem there because fundamentally when we restore the fpu status (fpu_swap_kvm_fpstate) we use the max feature bitmask anyway, and the dest hardware should support all of them. I don't quickly see how that could trigger a double fault, though.

I'll continue the dig probably next week, before that, any thoughts?

-- Peter Xu

Paolo Bonzini

6 Jun 6 Jun

4:18 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On 6/3/22 20:40, Peter Xu wrote:

...

I'm not really sure whether this is a bug or by design - do we require this patch to be applied to all stable branches to make the guest not crash after migration, or it is unexpected?

Yes, we do, though the only reported bug was for PKRU.

...

However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.

What CPU model are you using for the VM? For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

Paolo

Peter Xu

9:27 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:

...

...
However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.

What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

I just tried two other options with: (1) -cpu host, and (2) -cpu Haswell (the choice of Haswell was really random..), with the same 5.15->5.18 migration scenario, both of them will not trigger the same guest kernel crash. Only qemu64 will.

Both hosts have Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

...

For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).

Thanks,

-- Peter Xu

Paolo Bonzini

7 Jun 7 Jun

12:54 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On 6/6/22 23:27, Peter Xu wrote:

...

On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:

...
...
However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.

What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

Ok, so indeed it lacks AVX and this patch can have an effect.

...

...
For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).

Difficult to say what's happening without seeing at least the guest code around the double fault (above you said "fail a migration" and I thought that was a different scenario than the double fault), and possibly which was the first exception that contributed to the double fault.

Paolo

Sean Christopherson

3:04 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On Tue, Jun 07, 2022, Paolo Bonzini wrote:

...

On 6/6/22 23:27, Peter Xu wrote:

...
On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:

...
...
However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.

What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

Ok, so indeed it lacks AVX and this patch can have an effect.

...
...
For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).

Difficult to say what's happening without seeing at least the guest code around the double fault (above you said "fail a migration" and I thought that was a different scenario than the double fault), and possibly which was the first exception that contributed to the double fault.

Regardless of why the guest explodes in the way it does, is someone planning on bisecting this (if necessary?) and sending a backport to v5.15? There's another bug report that is more than likely hitting the same bug.

https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net

Peter Xu

6:17 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:

...

On Tue, Jun 07, 2022, Paolo Bonzini wrote:

...
On 6/6/22 23:27, Peter Xu wrote:

...
On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:

...
...
However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.

What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

Ok, so indeed it lacks AVX and this patch can have an effect.

...
...
For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).

Difficult to say what's happening without seeing at least the guest code around the double fault (above you said "fail a migration" and I thought that was a different scenario than the double fault), and possibly which was the first exception that contributed to the double fault.

Regardless of why the guest explodes in the way it does, is someone planning on bisecting this (if necessary?) and sending a backport to v5.15? There's another bug report that is more than likely hitting the same bug.

What's the bisection you mentioned? I actually did a bisection and I also checked reverting Leo's change can also fix this issue. Or do you mean something else?

...

https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net

That is kvm64, and I agree it could be the same problem since both qemu64 and kvm64 models do not have any xsave feature bit declared in cpuid 0xd, so potentially we could be migrating some fpu states to it even with user_xfeatures==0 on dest host.

So today I continued the investigation, and I think what's really missing is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and continues even if it returns -EINVAL. IOW, I'm wondering whether we should fail properly and start to check kvm_arch_put_registers() retcode. But that'll be a QEMU fix, and it'll at least not causing random faults (e.g. double faults) in guest but we should fail the migration gracefully.

Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():

WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));

It'll be great if you'd like to check that up.

Thanks,

-- Peter Xu

Sean Christopherson

6:47 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On Tue, Jun 07, 2022, Peter Xu wrote:

...

On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:

...
On Tue, Jun 07, 2022, Paolo Bonzini wrote:

...
On 6/6/22 23:27, Peter Xu wrote:

...
On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:

...
...
However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.

What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

Ok, so indeed it lacks AVX and this patch can have an effect.

...
...
For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).

Difficult to say what's happening without seeing at least the guest code around the double fault (above you said "fail a migration" and I thought that was a different scenario than the double fault), and possibly which was the first exception that contributed to the double fault.

Regardless of why the guest explodes in the way it does, is someone planning on bisecting this (if necessary?) and sending a backport to v5.15? There's another bug report that is more than likely hitting the same bug.

What's the bisection you mentioned? I actually did a bisection and I also checked reverting Leo's change can also fix this issue. Or do you mean something else?

Oooooh, sorry! I got completely turned around. You ran into a bug with the fix. I thought that you were hitting the same issues as Mike where migrating between hosts with different capabilities is broken in v5.15, but works in v5.18.

...

...
https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net

That is kvm64, and I agree it could be the same problem since both qemu64 and kvm64 models do not have any xsave feature bit declared in cpuid 0xd, so potentially we could be migrating some fpu states to it even with user_xfeatures==0 on dest host.

So today I continued the investigation, and I think what's really missing is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and continues even if it returns -EINVAL. IOW, I'm wondering whether we should fail properly and start to check kvm_arch_put_registers() retcode. But that'll be a QEMU fix, and it'll at least not causing random faults (e.g. double faults) in guest but we should fail the migration gracefully.

Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():

WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));

It'll be great if you'd like to check that up.

Ugh, userspace can force KVM_MP_STATE_UNINITIALIZED via KVM_SET_MP_STATE. Looks like QEMU does that when emulating RESET.

Logically, a full RESET of the xAPIC seems like the right thing to do. I think we can get away with that without breaking ABI? And kvm_lapic_reset() has a related bug where it stops the HR timer but not doesn't handle the HV timer :-/

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e69b83708f05..948aba894245 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2395,7 +2395,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) return;

/* Stop the timer in case it's a reset to an active apic */ - hrtimer_cancel(&apic->lapic_timer.timer); + cancel_apic_timer(&apic->lapic_timer.timer);

/* The xAPIC ID is set at RESET even if the APIC was already enabled. */ if (!init_event) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 540651cd28d7..ed2c7cb1642d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10912,6 +10912,9 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED)) goto out;

+ if (mp_state->mp_state == KVM_MP_STATE_UNINITIALIZED) + kvm_lapic_reset(vcpu, false); + if (mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED) { vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED; set_bit(KVM_APIC_SIPI, &vcpu->arch.apic->pending_events);

Peter Xu

9:01 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On Tue, Jun 07, 2022 at 06:47:41PM +0000, Sean Christopherson wrote:

...

On Tue, Jun 07, 2022, Peter Xu wrote:

...
On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:

...
On Tue, Jun 07, 2022, Paolo Bonzini wrote:

...
On 6/6/22 23:27, Peter Xu wrote:

...
On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:

...
> However there seems to be something missing at least to me, on why it'll > fail a migration from 5.15 (without this patch) to 5.18 (with this patch). > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this > patch, but 0x0 if with it.

What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

Ok, so indeed it lacks AVX and this patch can have an effect.

...
...
For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).

Difficult to say what's happening without seeing at least the guest code around the double fault (above you said "fail a migration" and I thought that was a different scenario than the double fault), and possibly which was the first exception that contributed to the double fault.

Regardless of why the guest explodes in the way it does, is someone planning on bisecting this (if necessary?) and sending a backport to v5.15? There's another bug report that is more than likely hitting the same bug.

What's the bisection you mentioned? I actually did a bisection and I also checked reverting Leo's change can also fix this issue. Or do you mean something else?

Oooooh, sorry! I got completely turned around. You ran into a bug with the fix. I thought that you were hitting the same issues as Mike where migrating between hosts with different capabilities is broken in v5.15, but works in v5.18.

Aha, no worry.

...

...
...
https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net

That is kvm64, and I agree it could be the same problem since both qemu64 and kvm64 models do not have any xsave feature bit declared in cpuid 0xd, so potentially we could be migrating some fpu states to it even with user_xfeatures==0 on dest host.

So today I continued the investigation, and I think what's really missing is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and continues even if it returns -EINVAL. IOW, I'm wondering whether we should fail properly and start to check kvm_arch_put_registers() retcode. But that'll be a QEMU fix, and it'll at least not causing random faults (e.g. double faults) in guest but we should fail the migration gracefully.

Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():

WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));

It'll be great if you'd like to check that up.

Ugh, userspace can force KVM_MP_STATE_UNINITIALIZED via KVM_SET_MP_STATE. Looks like QEMU does that when emulating RESET.

Logically, a full RESET of the xAPIC seems like the right thing to do. I think we can get away with that without breaking ABI? And kvm_lapic_reset() has a related bug where it stops the HR timer but not doesn't handle the HV timer :-/

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e69b83708f05..948aba894245 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2395,7 +2395,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) return;
    /* Stop the timer in case it's a reset to an active apic */
  hrtimer_cancel(&apic->lapic_timer.timer);
  cancel_apic_timer(&apic->lapic_timer.timer);

Needs to be:

+ cancel_apic_timer(apic);

...

    /* The xAPIC ID is set at RESET even if the APIC was already enabled. */
    if (!init_event)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 540651cd28d7..ed2c7cb1642d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10912,6 +10912,9 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED)) goto out;
  if (mp_state->mp_state == KVM_MP_STATE_UNINITIALIZED)
          kvm_lapic_reset(vcpu, false);
  if (mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED) {
          vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
          set_bit(KVM_APIC_SIPI, &vcpu->arch.apic->pending_events);

The change looks reasonable, but sadly I did a quick run and it still triggers.. :-/ So there seems to be something else missing.

-- Peter Xu

Peter Xu

6:55 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On Tue, Jun 07, 2022 at 02:17:54PM -0400, Peter Xu wrote:

...

On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:

...
On Tue, Jun 07, 2022, Paolo Bonzini wrote:

...
On 6/6/22 23:27, Peter Xu wrote:

...
On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:

...
...
However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.

What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

Ok, so indeed it lacks AVX and this patch can have an effect.

...
...
For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).

Difficult to say what's happening without seeing at least the guest code around the double fault (above you said "fail a migration" and I thought that was a different scenario than the double fault), and possibly which was the first exception that contributed to the double fault.

Regardless of why the guest explodes in the way it does, is someone planning on bisecting this (if necessary?) and sending a backport to v5.15? There's another bug report that is more than likely hitting the same bug.

What's the bisection you mentioned? I actually did a bisection and I also checked reverting Leo's change can also fix this issue. Or do you mean something else?

Ah, I forgot to mention on the "stable tree decisions": IIUC it also means we should apply Leo's patch to all the stable trees if possible, then migrations between them won't trigger the misterous faults anymore, including when migrating to the latest Linux versions.

However there's the delimma that other kernels (any kernel that does not have Leo's patch) will start to fail migrations to the stable branches that apply Leo's patch too.. So that's kind of a slight pity. It's just IIUC the stable trees are more important, because it should have a broader audience (most Linux distros)?

...

...
https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net

That is kvm64, and I agree it could be the same problem since both qemu64 and kvm64 models do not have any xsave feature bit declared in cpuid 0xd, so potentially we could be migrating some fpu states to it even with user_xfeatures==0 on dest host.

So today I continued the investigation, and I think what's really missing is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and continues even if it returns -EINVAL. IOW, I'm wondering whether we should fail properly and start to check kvm_arch_put_registers() retcode. But that'll be a QEMU fix, and it'll at least not causing random faults (e.g. double faults) in guest but we should fail the migration gracefully.

Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():

WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));

It'll be great if you'd like to check that up.

Thanks,

-- Peter Xu

-- Peter Xu

Leonardo Bras Soares Passos

8 Jun 8 Jun

8:34 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

Hello Peter,

On Tue, Jun 7, 2022 at 5:07 PM Peter Xu peterx@redhat.com wrote:

...

On Tue, Jun 07, 2022 at 02:17:54PM -0400, Peter Xu wrote:

...
On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:

...
On Tue, Jun 07, 2022, Paolo Bonzini wrote:

...
On 6/6/22 23:27, Peter Xu wrote:

...
On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:

...
> However there seems to be something missing at least to me, on why it'll > fail a migration from 5.15 (without this patch) to 5.18 (with this patch). > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this > patch, but 0x0 if with it.

What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

Ok, so indeed it lacks AVX and this patch can have an effect.

...
...
For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.

See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).

Difficult to say what's happening without seeing at least the guest code around the double fault (above you said "fail a migration" and I thought that was a different scenario than the double fault), and possibly which was the first exception that contributed to the double fault.

Regardless of why the guest explodes in the way it does, is someone planning on bisecting this (if necessary?) and sending a backport to v5.15? There's another bug report that is more than likely hitting the same bug.

What's the bisection you mentioned? I actually did a bisection and I also checked reverting Leo's change can also fix this issue. Or do you mean something else?

Ah, I forgot to mention on the "stable tree decisions": IIUC it also means we should apply Leo's patch to all the stable trees if possible, then migrations between them won't trigger the misterous faults anymore, including when migrating to the latest Linux versions.

However there's the delimma that other kernels (any kernel that does not have Leo's patch) will start to fail migrations to the stable branches that apply Leo's patch too..

IIUC, you commented before that the migration issue should be solved with a QEMU fix, is that correct? That would mean something like 'QEMU is relying on a kernel bug to work', and should be no blocker for fixing the kernel.

If that's the case, I think we should apply the fix to every supported stable branch that have the fpku issue, and in parallel come with a qemu fix for that.

What do you think about it?

Best regards, Leo

...

So that's kind of a slight pity. It's just IIUC the stable trees are more important, because it should have a broader audience (most Linux distros)?

...
...
https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net

That is kvm64, and I agree it could be the same problem since both qemu64 and kvm64 models do not have any xsave feature bit declared in cpuid 0xd, so potentially we could be migrating some fpu states to it even with user_xfeatures==0 on dest host.

So today I continued the investigation, and I think what's really missing is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and continues even if it returns -EINVAL. IOW, I'm wondering whether we should fail properly and start to check kvm_arch_put_registers() retcode. But that'll be a QEMU fix, and it'll at least not causing random faults (e.g. double faults) in guest but we should fail the migration gracefully.

Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():
  WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
It'll be great if you'd like to check that up.

Thanks,

-- Peter Xu
-- Peter Xu

Peter Xu

8:53 p.m.

New subject: [PATCH AUTOSEL 5.16 07/28] x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

On Wed, Jun 08, 2022 at 05:34:18PM -0300, Leonardo Bras Soares Passos wrote:

...

Hello Peter,

On Tue, Jun 7, 2022 at 5:07 PM Peter Xu peterx@redhat.com wrote:

...
On Tue, Jun 07, 2022 at 02:17:54PM -0400, Peter Xu wrote:

...
On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:

...
On Tue, Jun 07, 2022, Paolo Bonzini wrote:

...
On 6/6/22 23:27, Peter Xu wrote:

...
On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote: > > However there seems to be something missing at least to me, on why it'll > > fail a migration from 5.15 (without this patch) to 5.18 (with this patch). > > In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this > > patch, but 0x0 if with it. > > What CPU model are you using for the VM?

I didn't specify it, assuming it's qemu64 with no extra parameters.

Ok, so indeed it lacks AVX and this patch can have an effect.

...
> For example, if the source lacks this patch but the destination has it, > the source will transmit YMM registers, but the destination will fail to > set them if they are not available for the selected CPU model. > > See the commit message: "As a bonus, it will also fail if userspace tries to > set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to > the guest configuration. Such features will never be returned by > KVM_GET_XSAVE or KVM_GET_XSAVE2."

IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).

Difficult to say what's happening without seeing at least the guest code around the double fault (above you said "fail a migration" and I thought that was a different scenario than the double fault), and possibly which was the first exception that contributed to the double fault.

Regardless of why the guest explodes in the way it does, is someone planning on bisecting this (if necessary?) and sending a backport to v5.15? There's another bug report that is more than likely hitting the same bug.

What's the bisection you mentioned? I actually did a bisection and I also checked reverting Leo's change can also fix this issue. Or do you mean something else?

Ah, I forgot to mention on the "stable tree decisions": IIUC it also means we should apply Leo's patch to all the stable trees if possible, then migrations between them won't trigger the misterous faults anymore, including when migrating to the latest Linux versions.

However there's the delimma that other kernels (any kernel that does not have Leo's patch) will start to fail migrations to the stable branches that apply Leo's patch too..

IIUC, you commented before that the migration issue should be solved with a QEMU fix, is that correct? That would mean something like 'QEMU is relying on a kernel bug to work', and should be no blocker for fixing the kernel.

The QEMU fix (that I posted [1]) is not a real fix, only the kernel fix is.

The QEMU patchset only allows the migration to fail early, the kernel patch allows the migration to go through with no problem as long as both sides are applied with the fix (or both are not..). So there're two issues we're tackling with and IMHO we should fix both.

[1] https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/

...

If that's the case, I think we should apply the fix to every supported stable branch that have the fpku issue, and in parallel come with a qemu fix for that.

What do you think about it?

Yes I mostly agree with you. I think your patch still does the right thing by not migrating anything the guest doesn't even support, and that seems to be the only way to fix the pksu-like issue on migrations between hosts with different processor configurations. But it'll also bring other unwanted side effects, that's why IMHO we need some careful thoughts and I hope I didn't miss anything important.

Thanks,

-- Peter Xu

Sasha Levin

1 Mar 1 Mar

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 08/28] spi: rockchip: Fix error in getting num-cs property

From: Jon Lin jon.lin@rock-chips.com

[ Upstream commit 9382df0a98aad5bbcd4d634790305a1d786ad224 ]

Get num-cs u32 from dts of_node property rather than u16.

Signed-off-by: Jon Lin jon.lin@rock-chips.com Link: https://lore.kernel.org/r/20220216014028.8123-2-jon.lin@rock-chips.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/spi/spi-rockchip.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/spi/spi-rockchip.c b/drivers/spi/spi-rockchip.c index 553b6b9d02222..4f65ba3dd19c2 100644 --- a/drivers/spi/spi-rockchip.c +++ b/drivers/spi/spi-rockchip.c @@ -654,7 +654,7 @@ static int rockchip_spi_probe(struct platform_device *pdev) struct spi_controller *ctlr; struct resource *mem; struct device_node *np = pdev->dev.of_node; - u32 rsd_nsecs; + u32 rsd_nsecs, num_cs; bool slave_mode;

slave_mode = of_property_read_bool(np, "spi-slave"); @@ -764,8 +764,9 @@ static int rockchip_spi_probe(struct platform_device *pdev) * rk spi0 has two native cs, spi1..5 one cs only * if num-cs is missing in the dts, default to 1 */ - if (of_property_read_u16(np, "num-cs", &ctlr->num_chipselect)) - ctlr->num_chipselect = 1; + if (of_property_read_u32(np, "num-cs", &num_cs)) + num_cs = 1; + ctlr->num_chipselect = num_cs; ctlr->use_gpio_descriptors = true; } ctlr->dev.of_node = pdev->dev.of_node;

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 09/28] spi: rockchip: terminate dma transmission when slave abort

From: Jon Lin jon.lin@rock-chips.com

[ Upstream commit 80808768e41324d2e23de89972b5406c1020e6e4 ]

After slave abort, all DMA should be stopped, or it will affect the next transmission and maybe abort again.

Signed-off-by: Jon Lin jon.lin@rock-chips.com Link: https://lore.kernel.org/r/20220216014028.8123-3-jon.lin@rock-chips.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/spi/spi-rockchip.c | 6 ++++++ 1 file changed, 6 insertions(+)

diff --git a/drivers/spi/spi-rockchip.c b/drivers/spi/spi-rockchip.c index 4f65ba3dd19c2..c6a1bb09be056 100644 --- a/drivers/spi/spi-rockchip.c +++ b/drivers/spi/spi-rockchip.c @@ -585,6 +585,12 @@ static int rockchip_spi_slave_abort(struct spi_controller *ctlr) { struct rockchip_spi *rs = spi_controller_get_devdata(ctlr);

+ if (atomic_read(&rs->state) & RXDMA) + dmaengine_terminate_sync(ctlr->dma_rx); + if (atomic_read(&rs->state) & TXDMA) + dmaengine_terminate_sync(ctlr->dma_tx); + atomic_set(&rs->state, 0); + spi_enable_chip(rs, false); rs->slave_abort = true; spi_finalize_current_transfer(ctlr);

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 10/28] drm/vc4: hdmi: Unregister codec device on unbind

From: Maxime Ripard maxime@cerno.tech

[ Upstream commit e40945ab7c7f966d0c37b7bd7b0596497dfe228d ]

On bind we will register the HDMI codec device but we don't unregister it on unbind, leading to a device leakage. Unregister our device at unbind.

Signed-off-by: Maxime Ripard maxime@cerno.tech Reviewed-by: Javier Martinez Canillas javierm@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20220127111452.222002-1-maxime... Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/vc4/vc4_hdmi.c | 8 ++++++++ drivers/gpu/drm/vc4/vc4_hdmi.h | 1 + 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/vc4/vc4_hdmi.c b/drivers/gpu/drm/vc4/vc4_hdmi.c index 24f11c07bc3c7..2f53ba54b81ac 100644 --- a/drivers/gpu/drm/vc4/vc4_hdmi.c +++ b/drivers/gpu/drm/vc4/vc4_hdmi.c @@ -1522,6 +1522,7 @@ static int vc4_hdmi_audio_init(struct vc4_hdmi *vc4_hdmi) dev_err(dev, "Couldn't register the HDMI codec: %ld\n", PTR_ERR(codec_pdev)); return PTR_ERR(codec_pdev); } + vc4_hdmi->audio.codec_pdev = codec_pdev;

dai_link->cpus = &vc4_hdmi->audio.cpu; dai_link->codecs = &vc4_hdmi->audio.codec; @@ -1561,6 +1562,12 @@ static int vc4_hdmi_audio_init(struct vc4_hdmi *vc4_hdmi)

}

+static void vc4_hdmi_audio_exit(struct vc4_hdmi *vc4_hdmi) +{ + platform_device_unregister(vc4_hdmi->audio.codec_pdev); + vc4_hdmi->audio.codec_pdev = NULL; +} + static irqreturn_t vc4_hdmi_hpd_irq_thread(int irq, void *priv) { struct vc4_hdmi *vc4_hdmi = priv; @@ -2299,6 +2306,7 @@ static void vc4_hdmi_unbind(struct device *dev, struct device *master, kfree(vc4_hdmi->hdmi_regset.regs); kfree(vc4_hdmi->hd_regset.regs);

+ vc4_hdmi_audio_exit(vc4_hdmi); vc4_hdmi_cec_exit(vc4_hdmi); vc4_hdmi_hotplug_exit(vc4_hdmi); vc4_hdmi_connector_destroy(&vc4_hdmi->connector); diff --git a/drivers/gpu/drm/vc4/vc4_hdmi.h b/drivers/gpu/drm/vc4/vc4_hdmi.h index 33e9f665ab8e4..c0492da736833 100644 --- a/drivers/gpu/drm/vc4/vc4_hdmi.h +++ b/drivers/gpu/drm/vc4/vc4_hdmi.h @@ -113,6 +113,7 @@ struct vc4_hdmi_audio { struct snd_soc_dai_link_component platform; struct snd_dmaengine_dai_dma_data dma_data; struct hdmi_audio_infoframe infoframe; + struct platform_device *codec_pdev; bool streaming; };

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 11/28] of/fdt: move elfcorehdr reservation early for crash dump kernel

From: Nikhil Gupta nikhil.gupta@nxp.com

[ Upstream commit 132507ed04ce0c5559be04dd378fec4f3bbc00e8 ]

elfcorehdr_addr is fixed address passed to Second kernel which may be conflicted with potential reserved memory in Second kernel,so fdt_reserve_elfcorehdr() ahead of fdt_init_reserved_mem() can relieve this situation.

Signed-off-by: Nikhil Gupta nikhil.gupta@nxp.com Signed-off-by: Rob Herring robh@kernel.org Link: https://lore.kernel.org/r/20220128042321.15228-1-nikhil.gupta@nxp.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/of/fdt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index 7e868e5995b7e..f66abb496ed16 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -644,8 +644,8 @@ void __init early_init_fdt_scan_reserved_mem(void) }

fdt_scan_reserved_mem(); - fdt_init_reserved_mem(); fdt_reserve_elfcorehdr(); + fdt_init_reserved_mem(); }

/**

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 12/28] x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU

From: Wanpeng Li wanpengli@tencent.com

[ Upstream commit ec756e40e271866f951d77c5e923d8deb6002b15 ]

Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask.

Signed-off-by: Wanpeng Li wanpengli@tencent.com Message-Id: 1645171838-2855-1-git-send-email-wanpengli@tencent.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kernel/kvm.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 59abbdad7729c..ff3db164e52cb 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -462,19 +462,22 @@ static bool pv_tlb_flush_supported(void) { return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) && !kvm_para_has_hint(KVM_HINTS_REALTIME) && - kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)); + kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) && + (num_possible_cpus() != 1)); }

static bool pv_ipi_supported(void) { - return kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI); + return (kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI) && + (num_possible_cpus() != 1)); }

static bool pv_sched_yield_supported(void) { return (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) && !kvm_para_has_hint(KVM_HINTS_REALTIME) && - kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)); + kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) && + (num_possible_cpus() != 1)); }

#define KVM_IPI_CLUSTER_SIZE (2 * BITS_PER_LONG)

-- 2.34.1

Paolo Bonzini

8:17 p.m.

New subject: [PATCH AUTOSEL 5.16 12/28] x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU

On 3/1/22 21:13, Sasha Levin wrote:

...

From: Wanpeng Li wanpengli@tencent.com

[ Upstream commit ec756e40e271866f951d77c5e923d8deb6002b15 ]

Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask.

Signed-off-by: Wanpeng Li wanpengli@tencent.com Message-Id: 1645171838-2855-1-git-send-email-wanpengli@tencent.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org

arch/x86/kernel/kvm.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 59abbdad7729c..ff3db164e52cb 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -462,19 +462,22 @@ static bool pv_tlb_flush_supported(void) { return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) && !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
(num_possible_cpus() != 1));
}
static bool pv_ipi_supported(void) {

return kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI);
return (kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI) &&
      (num_possible_cpus() != 1));
}
static bool pv_sched_yield_supported(void) { return (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) && !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
   kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
   kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
   (num_possible_cpus() != 1));
}
#define KVM_IPI_CLUSTER_SIZE (2 * BITS_PER_LONG)

NACK

Not really necessary.

Paolo

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 13/28] drivers: hamradio: 6pack: fix UAF bug caused by mod_timer()

From: Duoming Zhou duoming@zju.edu.cn

[ Upstream commit efe4186e6a1b54bf38b9e05450d43b0da1fd7739 ]

When a 6pack device is detaching, the sixpack_close() will act to cleanup necessary resources. Although del_timer_sync() in sixpack_close() won't return if there is an active timer, one could use mod_timer() in sp_xmit_on_air() to wake up timer again by calling userspace syscall such as ax25_sendmsg(), ax25_connect() and ax25_ioctl().

This unexpected waked handler, sp_xmit_on_air(), realizes nothing about the undergoing cleanup and may still call pty_write() to use driver layer resources that have already been released.

One of the possible race conditions is shown below:

The corresponding fail log is shown below: =============================================================== BUG: KASAN: use-after-free in __run_timers.part.0+0x170/0x470 Write of size 8 at addr ffff88800a652ab8 by task swapper/2/0 ... Call Trace: ... queue_work_on+0x3f/0x50 pty_write+0xcd/0xe0pty_write+0xcd/0xe0 sp_xmit_on_air+0xb2/0x1f0 call_timer_fn+0x28/0x150 __run_timers.part.0+0x3c2/0x470 run_timer_softirq+0x3b/0x80 __do_softirq+0xf1/0x380 ...

This patch reorders the del_timer_sync() after the unregister_netdev() to avoid UAF bugs. Because the unregister_netdev() is well synchronized, it flushs out any pending queues, waits the refcount of net_device decreases to zero and removes net_device from kernel. There is not any running routines after executing unregister_netdev(). Therefore, we could not arouse timer from userspace again.

Signed-off-by: Duoming Zhou duoming@zju.edu.cn Reviewed-by: Lin Ma linma@zju.edu.cn Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/hamradio/6pack.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c index 8a19a06b505d1..ff2bb3d80fac8 100644 --- a/drivers/net/hamradio/6pack.c +++ b/drivers/net/hamradio/6pack.c @@ -668,11 +668,11 @@ static void sixpack_close(struct tty_struct *tty) */ netif_stop_queue(sp->dev);

+ unregister_netdev(sp->dev); + del_timer_sync(&sp->tx_t); del_timer_sync(&sp->resync_t);

- unregister_netdev(sp->dev); - /* Free all 6pack frame buffers after unreg. */ kfree(sp->rbuff); kfree(sp->xbuff);

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 14/28] net-sysfs: add check for netdevice being present to speed_show

From: suresh kumar suresh2514@gmail.com

[ Upstream commit 4224cfd7fb6523f7a9d1c8bb91bb5df1e38eb624 ]

When bringing down the netdevice or system shutdown, a panic can be triggered while accessing the sysfs path because the device is already removed.

[ 755.549084] mlx5_core 0000:12:00.1: Shutdown was called [ 756.404455] mlx5_core 0000:12:00.0: Shutdown was called ... [ 757.937260] BUG: unable to handle kernel NULL pointer dereference at (null) [ 758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

crash> bt ... PID: 12649 TASK: ffff8924108f2100 CPU: 1 COMMAND: "amsd" ... #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778 [exception RIP: dma_pool_alloc+0x1ab] RIP: ffffffff8ee11acb RSP: ffff89240e1a3968 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffff89243d874100 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff89243d874090 RBP: ffff89240e1a39c0 R8: 000000000001f080 R9: ffff8905ffc03c00 R10: ffffffffc04680d4 R11: ffffffff8edde9fd R12: 00000000000080d0 R13: ffff89243d874090 R14: ffff89243d874080 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core] #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core] #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core] #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core] #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core] #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core] #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core] #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46 #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208 #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3 #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596 #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10 #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5 #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

crash> net_device.state ffff89443b0c0000 state = 0x5 (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar suresh2514@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/net-sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index d7f9ee830d34c..9e5657f632453 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -213,7 +213,7 @@ static ssize_t speed_show(struct device *dev, if (!rtnl_trylock()) return restart_syscall();

- if (netif_running(netdev)) { + if (netif_running(netdev) && netif_device_present(netdev)) { struct ethtool_link_ksettings cmd;

if (!__ethtool_get_link_ksettings(netdev, &cmd))

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 15/28] sr9700: sanity check for packet length

From: Oliver Neukum oneukum@suse.com

[ Upstream commit e9da0b56fe27206b49f39805f7dcda8a89379062 ]

A malicious device can leak heap data to user space providing bogus frame lengths. Introduce a sanity check.

Signed-off-by: Oliver Neukum oneukum@suse.com Reviewed-by: Grant Grundler grundler@chromium.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/usb/sr9700.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/sr9700.c b/drivers/net/usb/sr9700.c index b658510cc9a42..5a53e63d33a60 100644 --- a/drivers/net/usb/sr9700.c +++ b/drivers/net/usb/sr9700.c @@ -413,7 +413,7 @@ static int sr9700_rx_fixup(struct usbnet *dev, struct sk_buff *skb) /* ignore the CRC length */ len = (skb->data[1] | (skb->data[2] << 8)) - 4;

- if (len > ETH_FRAME_LEN) + if (len > ETH_FRAME_LEN || len > skb->len) return 0;

/* the last packet of current skb */

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 16/28] hwmon: (pmbus) Clear pmbus fault/warning bits after read

From: Vikash Chandola vikash.chandola@linux.intel.com

[ Upstream commit 35f165f08950a876f1b95a61d79c93678fba2fd6 ]

Almost all fault/warning bits in pmbus status registers remain set even after fault/warning condition are removed. As per pmbus specification these faults must be cleared by user. Modify hwmon behavior to clear fault/warning bit after fetching data if fault/warning bit was set. This allows to get fresh data in next read.

Signed-off-by: Vikash Chandola vikash.chandola@linux.intel.com Link: https://lore.kernel.org/r/20220222131253.2426834-1-vikash.chandola@linux.int... Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/hwmon/pmbus/pmbus_core.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/drivers/hwmon/pmbus/pmbus_core.c b/drivers/hwmon/pmbus/pmbus_core.c index 776ee2237be20..ac2fbee1ba9c0 100644 --- a/drivers/hwmon/pmbus/pmbus_core.c +++ b/drivers/hwmon/pmbus/pmbus_core.c @@ -911,6 +911,11 @@ static int pmbus_get_boolean(struct i2c_client *client, struct pmbus_boolean *b, pmbus_update_sensor_data(client, s2);

regval = status & mask; + if (regval) { + ret = pmbus_write_byte_data(client, page, reg, regval); + if (ret) + goto unlock; + } if (s1 && s2) { s64 v1, v2;

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 17/28] nvme-tcp: send H2CData PDUs based on MAXH2CDATA

From: Varun Prakash varun@chelsio.com

[ Upstream commit c2700d2886a87f83f31e0a301de1d2350b52c79b ]

As per NVMe/TCP specification (revision 1.0a, section 3.6.2.3) Maximum Host to Controller Data length (MAXH2CDATA): Specifies the maximum number of PDU-Data bytes per H2CData PDU in bytes. This value is a multiple of dwords and should be no less than 4,096.

Current code sets H2CData PDU data_length to r2t_length, it does not check MAXH2CDATA value. Fix this by setting H2CData PDU data_length to min(req->h2cdata_left, queue->maxh2cdata).

Also validate MAXH2CDATA value returned by target in ICResp PDU, if it is not a multiple of dword or if it is less than 4096 return -EINVAL from nvme_tcp_init_connection().

Signed-off-by: Varun Prakash varun@chelsio.com Reviewed-by: Sagi Grimberg sagi@grimberg.me Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nvme/host/tcp.c | 63 +++++++++++++++++++++++++++++++--------- include/linux/nvme-tcp.h | 1 + 2 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 891a36d02e7c7..65e00c64a588b 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -44,6 +44,8 @@ struct nvme_tcp_request { u32 data_len; u32 pdu_len; u32 pdu_sent; + u32 h2cdata_left; + u32 h2cdata_offset; u16 ttag; __le16 status; struct list_head entry; @@ -95,6 +97,7 @@ struct nvme_tcp_queue { struct nvme_tcp_request *request;

int queue_size; + u32 maxh2cdata; size_t cmnd_capsule_len; struct nvme_tcp_ctrl *ctrl; unsigned long flags; @@ -572,23 +575,26 @@ static int nvme_tcp_handle_comp(struct nvme_tcp_queue *queue, return ret; }

-static void nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req, - struct nvme_tcp_r2t_pdu *pdu) +static void nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req) { struct nvme_tcp_data_pdu *data = req->pdu; struct nvme_tcp_queue *queue = req->queue; struct request *rq = blk_mq_rq_from_pdu(req); + u32 h2cdata_sent = req->pdu_len; u8 hdgst = nvme_tcp_hdgst_len(queue); u8 ddgst = nvme_tcp_ddgst_len(queue);

req->state = NVME_TCP_SEND_H2C_PDU; req->offset = 0; - req->pdu_len = le32_to_cpu(pdu->r2t_length); + req->pdu_len = min(req->h2cdata_left, queue->maxh2cdata); req->pdu_sent = 0; + req->h2cdata_left -= req->pdu_len; + req->h2cdata_offset += h2cdata_sent;

memset(data, 0, sizeof(*data)); data->hdr.type = nvme_tcp_h2c_data; - data->hdr.flags = NVME_TCP_F_DATA_LAST; + if (!req->h2cdata_left) + data->hdr.flags = NVME_TCP_F_DATA_LAST; if (queue->hdr_digest) data->hdr.flags |= NVME_TCP_F_HDGST; if (queue->data_digest) @@ -597,9 +603,9 @@ static void nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req, data->hdr.pdo = data->hdr.hlen + hdgst; data->hdr.plen = cpu_to_le32(data->hdr.hlen + hdgst + req->pdu_len + ddgst); - data->ttag = pdu->ttag; + data->ttag = req->ttag; data->command_id = nvme_cid(rq); - data->data_offset = pdu->r2t_offset; + data->data_offset = cpu_to_le32(req->h2cdata_offset); data->data_length = cpu_to_le32(req->pdu_len); }

@@ -609,6 +615,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue, struct nvme_tcp_request *req; struct request *rq; u32 r2t_length = le32_to_cpu(pdu->r2t_length); + u32 r2t_offset = le32_to_cpu(pdu->r2t_offset);

rq = nvme_find_rq(nvme_tcp_tagset(queue), pdu->command_id); if (!rq) { @@ -633,14 +640,19 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue, return -EPROTO; }

- if (unlikely(le32_to_cpu(pdu->r2t_offset) < req->data_sent)) { + if (unlikely(r2t_offset < req->data_sent)) { dev_err(queue->ctrl->ctrl.device, "req %d unexpected r2t offset %u (expected %zu)\n", - rq->tag, le32_to_cpu(pdu->r2t_offset), req->data_sent); + rq->tag, r2t_offset, req->data_sent); return -EPROTO; }

- nvme_tcp_setup_h2c_data_pdu(req, pdu); + req->pdu_len = 0; + req->h2cdata_left = r2t_length; + req->h2cdata_offset = r2t_offset; + req->ttag = pdu->ttag; + + nvme_tcp_setup_h2c_data_pdu(req); nvme_tcp_queue_request(req, false, true);

return 0; @@ -928,6 +940,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) { struct nvme_tcp_queue *queue = req->queue; int req_data_len = req->data_len; + u32 h2cdata_left = req->h2cdata_left;

while (true) { struct page *page = nvme_tcp_req_cur_page(req); @@ -972,7 +985,10 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) req->state = NVME_TCP_SEND_DDGST; req->offset = 0; } else { - nvme_tcp_done_send_req(queue); + if (h2cdata_left) + nvme_tcp_setup_h2c_data_pdu(req); + else + nvme_tcp_done_send_req(queue); } return 1; } @@ -1030,9 +1046,14 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req) if (queue->hdr_digest && !req->offset) nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));

- ret = kernel_sendpage(queue->sock, virt_to_page(pdu), - offset_in_page(pdu) + req->offset, len, - MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST); + if (!req->h2cdata_left) + ret = kernel_sendpage(queue->sock, virt_to_page(pdu), + offset_in_page(pdu) + req->offset, len, + MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST); + else + ret = sock_no_sendpage(queue->sock, virt_to_page(pdu), + offset_in_page(pdu) + req->offset, len, + MSG_DONTWAIT | MSG_MORE); if (unlikely(ret <= 0)) return ret;

@@ -1052,6 +1073,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req) { struct nvme_tcp_queue *queue = req->queue; size_t offset = req->offset; + u32 h2cdata_left = req->h2cdata_left; int ret; struct msghdr msg = { .msg_flags = MSG_DONTWAIT }; struct kvec iov = { @@ -1069,7 +1091,10 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req) return ret;

if (offset + ret == NVME_TCP_DIGEST_LENGTH) { - nvme_tcp_done_send_req(queue); + if (h2cdata_left) + nvme_tcp_setup_h2c_data_pdu(req); + else + nvme_tcp_done_send_req(queue); return 1; }

@@ -1261,6 +1286,7 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue) struct msghdr msg = {}; struct kvec iov; bool ctrl_hdgst, ctrl_ddgst; + u32 maxh2cdata; int ret;

icreq = kzalloc(sizeof(*icreq), GFP_KERNEL); @@ -1344,6 +1370,14 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue) goto free_icresp; }

+ maxh2cdata = le32_to_cpu(icresp->maxdata); + if ((maxh2cdata % 4) || (maxh2cdata < NVME_TCP_MIN_MAXH2CDATA)) { + pr_err("queue %d: invalid maxh2cdata returned %u\n", + nvme_tcp_queue_id(queue), maxh2cdata); + goto free_icresp; + } + queue->maxh2cdata = maxh2cdata; + ret = 0; free_icresp: kfree(icresp); @@ -2329,6 +2363,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns, req->data_sent = 0; req->pdu_len = 0; req->pdu_sent = 0; + req->h2cdata_left = 0; req->data_len = blk_rq_nr_phys_segments(rq) ? blk_rq_payload_bytes(rq) : 0; req->curr_bio = rq->bio; diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h index 959e0bd9a913e..75470159a194d 100644 --- a/include/linux/nvme-tcp.h +++ b/include/linux/nvme-tcp.h @@ -12,6 +12,7 @@ #define NVME_TCP_DISC_PORT 8009 #define NVME_TCP_ADMIN_CCSZ SZ_8K #define NVME_TCP_DIGEST_LENGTH 4 +#define NVME_TCP_MIN_MAXH2CDATA 4096

enum nvme_tcp_pfv { NVME_TCP_PFV_1_0 = 0x0,

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 18/28] PCI: Mark all AMD Navi10 and Navi14 GPU ATS as broken

From: Alex Deucher alexander.deucher@amd.com

[ Upstream commit 3f1271b54edcc692da5a3663f2aa2a64781f9bc3 ]

There are enough VBIOS escapes without the proper workaround that some users still hit this. Microsoft never productized ATS on Windows so OEM platforms that were Windows-only didn't always validate ATS.

The advantages of ATS are not worth it compared to the potential instabilities on harvested boards. Disable ATS on all Navi10 and Navi14 boards.

Symptoms include:

amdgpu 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0007 address=0xffffc02000 flags=0x0000] AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x0007 address=0xffffc02000 flags=0x0000] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=6047, emitted seq=6049 amdgpu 0000:07:00.0: amdgpu: GPU reset begin! amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110) [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110 amdgpu 0000:07:00.0: amdgpu: GPU reset(1) failed

Related commits:

e8946a53e2a6 ("PCI: Mark AMD Navi14 GPU ATS as broken") a2da5d8cc0b0 ("PCI: Mark AMD Raven iGPU ATS as broken in some platforms") 45beb31d3afb ("PCI: Mark AMD Navi10 GPU rev 0x00 ATS as broken") 5e89cd303e3a ("PCI: Mark AMD Navi14 GPU rev 0xc5 ATS as broken") d28ca864c493 ("PCI: Mark AMD Stoney Radeon R7 GPU ATS as broken") 9b44b0b09dec ("PCI: Mark AMD Stoney GPU ATS as broken")

[bhelgaas: add symptoms and related commits] Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1760 Link: https://lore.kernel.org/r/20220222160801.841643-1-alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Acked-by: Christian König christian.koenig@amd.com Acked-by: Guchun Chen guchun.chen@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/pci/quirks.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 20a9326907384..db864bf634a3e 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -5344,11 +5344,6 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SERVERWORKS, 0x0422, quirk_no_ext_tags); */ static void quirk_amd_harvest_no_ats(struct pci_dev *pdev) { - if ((pdev->device == 0x7312 && pdev->revision != 0x00) || - (pdev->device == 0x7340 && pdev->revision != 0xc5) || - (pdev->device == 0x7341 && pdev->revision != 0x00)) - return; - if (pdev->device == 0x15d8) { if (pdev->revision == 0xcf && pdev->subsystem_vendor == 0xea50 && @@ -5370,10 +5365,19 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_amd_harvest_no_ats); /* AMD Iceland dGPU */ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_amd_harvest_no_ats); /* AMD Navi10 dGPU */ +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7310, quirk_amd_harvest_no_ats); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7312, quirk_amd_harvest_no_ats); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7318, quirk_amd_harvest_no_ats); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7319, quirk_amd_harvest_no_ats); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x731a, quirk_amd_harvest_no_ats); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x731b, quirk_amd_harvest_no_ats); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x731e, quirk_amd_harvest_no_ats); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x731f, quirk_amd_harvest_no_ats); /* AMD Navi14 dGPU */ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7340, quirk_amd_harvest_no_ats); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7341, quirk_amd_harvest_no_ats); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7347, quirk_amd_harvest_no_ats); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x734f, quirk_amd_harvest_no_ats); /* AMD Raven platform iGPU */ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x15d8, quirk_amd_harvest_no_ats); #endif /* CONFIG_PCI_ATS */

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 19/28] gpio: Return EPROBE_DEFER if gc->to_irq is NULL

From: Shreeya Patel shreeya.patel@collabora.com

[ Upstream commit ae42f9288846353982e2eab181fb41e7fd8bf60f ]

We are racing the registering of .to_irq when probing the i2c driver. This results in random failure of touchscreen devices.

Following explains the race condition better.

[gpio driver] gpio driver registers gpio chip [gpio consumer] gpio is acquired [gpio consumer] gpiod_to_irq() fails with -ENXIO [gpio driver] gpio driver registers irqchip gpiod_to_irq works at this point, but -ENXIO is fatal

We could see the following errors in dmesg logs when gc->to_irq is NULL

[2.101857] i2c_hid i2c-FTS3528:00: HID over i2c has not been provided an Int IRQ [2.101953] i2c_hid: probe of i2c-FTS3528:00 failed with error -22

To avoid this situation, defer probing until to_irq is registered. Returning -EPROBE_DEFER would be the first step towards avoiding the failure of devices due to the race in registration of .to_irq. Final solution to this issue would be to avoid using gc irq members until they are fully initialized.

This issue has been reported many times in past and people have been using workarounds like changing the pinctrl_amd to built-in instead of loading it as a module or by adding a softdep for pinctrl_amd into the config file.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209413 Reviewed-by: Linus Walleij linus.walleij@linaro.org Reviewed-by: Andy Shevchenko andy.shevchenko@gmail.com Reported-by: kernel test robot lkp@intel.com Signed-off-by: Shreeya Patel shreeya.patel@collabora.com Signed-off-by: Bartosz Golaszewski brgl@bgdev.pl Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpio/gpiolib.c | 10 ++++++++++ 1 file changed, 10 insertions(+)

diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index abfbf546d1599..7b3f7f4d1d063 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -3111,6 +3111,16 @@ int gpiod_to_irq(const struct gpio_desc *desc)

return retirq; } +#ifdef CONFIG_GPIOLIB_IRQCHIP + if (gc->irq.chip) { + /* + * Avoid race condition with other code, which tries to lookup + * an IRQ before the irqchip has been properly registered, + * i.e. while gpiochip is still being brought up. + */ + return -EPROBE_DEFER; + } +#endif return -ENXIO; } EXPORT_SYMBOL_GPL(gpiod_to_irq);

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 20/28] drm/amdgpu: bypass tiling flag check in virtual display case (v2)

From: Guchun Chen guchun.chen@amd.com

[ Upstream commit e2b993302f40c4eb714ecf896dd9e1c5be7d4cd7 ]

vkms leverages common amdgpu framebuffer creation, and also as it does not support FB modifier, there is no need to check tiling flags when initing framebuffer when virtual display is enabled.

This can fix below calltrace:

amdgpu 0000:00:08.0: GFX9+ requires FB check based on format modifier WARNING: CPU: 0 PID: 1023 at drivers/gpu/drm/amd/amdgpu/amdgpu_display.c:1150 amdgpu_display_framebuffer_init+0x8e7/0xb40 [amdgpu]

v2: check adev->enable_virtual_display instead as vkms can be enabled in bare metal as well.

Signed-off-by: Leslie Shi Yuliang.Shi@amd.com Signed-off-by: Guchun Chen guchun.chen@amd.com Reviewed-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c index dc50c05f23fc2..5c08047adb594 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c @@ -1145,7 +1145,7 @@ int amdgpu_display_framebuffer_init(struct drm_device *dev, if (ret) return ret;

- if (!dev->mode_config.allow_fb_modifiers) { + if (!dev->mode_config.allow_fb_modifiers && !adev->enable_virtual_display) { drm_WARN_ONCE(dev, adev->family >= AMDGPU_FAMILY_AI, "GFX9+ requires FB check based on format modifier\n"); ret = check_tiling_flags_gfx6(rfb);

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 21/28] Revert "xen-netback: remove 'hotplug-status' once it has served its purpose"

From: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com

[ Upstream commit 0f4558ae91870692ce7f509c31c9d6ee721d8cdc ]

This reverts commit 1f2565780e9b7218cf92c7630130e82dcc0fe9c2.

The 'hotplug-status' node should not be removed as long as the vif device remains configured. Otherwise the xen-netback would wait for re-running the network script even if it was already called (in case of the frontent re-connecting). But also, it _should_ be removed when the vif device is destroyed (for example when unbinding the driver) - otherwise hotplug script would not configure the device whenever it re-appear.

Moving removal of the 'hotplug-status' node was a workaround for nothing calling network script after xen-netback module is reloaded. But when vif interface is re-created (on xen-netback unbind/bind for example), the script should be called, regardless of who does that - currently this case is not handled by the toolstack, and requires manual script call. Keeping hotplug-status=connected to skip the call is wrong and leads to not configured interface.

More discussion at https://lore.kernel.org/xen-devel/afedd7cb-a291-e773-8b0d-4db9b291fa98@ipxe....

Signed-off-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Reviewed-by: Paul Durrant paul@xen.org Link: https://lore.kernel.org/r/20220222001817.2264967-1-marmarek@invisiblethingsl... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/xen-netback/xenbus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c index d24b7a7993aa0..3fad58d22155b 100644 --- a/drivers/net/xen-netback/xenbus.c +++ b/drivers/net/xen-netback/xenbus.c @@ -256,6 +256,7 @@ static void backend_disconnect(struct backend_info *be) unsigned int queue_index;

xen_unregister_watchers(vif); + xenbus_rm(XBT_NIL, be->dev->nodename, "hotplug-status"); #ifdef CONFIG_DEBUG_FS xenvif_debugfs_delif(vif); #endif /* CONFIG_DEBUG_FS */ @@ -675,7 +676,6 @@ static void hotplug_status_changed(struct xenbus_watch *watch,

/* Not interested in this watch anymore. */ unregister_hotplug_status_watch(be); - xenbus_rm(XBT_NIL, be->dev->nodename, "hotplug-status"); } kfree(str); }

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 22/28] Revert "xen-netback: Check for hotplug-status existence before watching"

From: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com

[ Upstream commit e8240addd0a3919e0fd7436416afe9aa6429c484 ]

This reverts commit 2afeec08ab5c86ae21952151f726bfe184f6b23d.

The reasoning in the commit was wrong - the code expected to setup the watch even if 'hotplug-status' didn't exist. In fact, it relied on the watch being fired the first time - to check if maybe 'hotplug-status' is already set to 'connected'. Not registering a watch for non-existing path (which is the case if hotplug script hasn't been executed yet), made the backend not waiting for the hotplug script to execute. This in turns, made the netfront think the interface is fully operational, while in fact it was not (the vif interface on xen-netback side might not be configured yet).

This was a workaround for 'hotplug-status' erroneously being removed. But since that is reverted now, the workaround is not necessary either.

More discussion at https://lore.kernel.org/xen-devel/afedd7cb-a291-e773-8b0d-4db9b291fa98@ipxe....

Signed-off-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Reviewed-by: Paul Durrant paul@xen.org Reviewed-by: Michael Brown mbrown@fensystems.co.uk Link: https://lore.kernel.org/r/20220222001817.2264967-2-marmarek@invisiblethingsl... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/xen-netback/xenbus.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c index 3fad58d22155b..990360d75cb64 100644 --- a/drivers/net/xen-netback/xenbus.c +++ b/drivers/net/xen-netback/xenbus.c @@ -824,15 +824,11 @@ static void connect(struct backend_info *be) xenvif_carrier_on(be->vif);

unregister_hotplug_status_watch(be); - if (xenbus_exists(XBT_NIL, dev->nodename, "hotplug-status")) { - err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch, - NULL, hotplug_status_changed, - "%s/%s", dev->nodename, - "hotplug-status"); - if (err) - goto err; + err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch, NULL, + hotplug_status_changed, + "%s/%s", dev->nodename, "hotplug-status"); + if (!err) be->have_hotplug_status_watch = 1; - }

netif_tx_wake_all_queues(be->vif->dev);

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 23/28] ipv6: prevent a possible race condition with lifetimes

From: Niels Dossche dossche.niels@gmail.com

[ Upstream commit 6c0d8833a605e195ae219b5042577ce52bf71fff ]

valid_lft, prefered_lft and tstamp are always accessed under the lock "lock" in other places. Reading these without taking the lock may result in inconsistencies regarding the calculation of the valid and preferred variables since decisions are taken on these fields for those variables.

Signed-off-by: Niels Dossche dossche.niels@gmail.com Reviewed-by: David Ahern dsahern@kernel.org Signed-off-by: Niels Dossche niels.dossche@ugent.be Link: https://lore.kernel.org/r/20220223131954.6570-1-niels.dossche@ugent.be Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/ipv6/addrconf.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 6652d96329a0c..ec9b8de5dd88a 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4998,6 +4998,7 @@ static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa, nla_put_s32(skb, IFA_TARGET_NETNSID, args->netnsid)) goto error;

+ spin_lock_bh(&ifa->lock); if (!((ifa->flags&IFA_F_PERMANENT) && (ifa->prefered_lft == INFINITY_LIFE_TIME))) { preferred = ifa->prefered_lft; @@ -5019,6 +5020,7 @@ static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa, preferred = INFINITY_LIFE_TIME; valid = INFINITY_LIFE_TIME; } + spin_unlock_bh(&ifa->lock);

if (!ipv6_addr_any(&ifa->peer_addr)) { if (nla_put_in6_addr(skb, IFA_LOCAL, &ifa->addr) < 0 ||

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 24/28] tracing: Ensure trace buffer is at least 4096 bytes large

From: Sven Schnelle svens@linux.ibm.com

[ Upstream commit 7acf3a127bb7c65ff39099afd78960e77b2ca5de ]

Booting the kernel with 'trace_buf_size=1' give a warning at boot during the ftrace selftests:

[ 0.892809] Running postponed tracer tests: [ 0.892893] Testing tracer function: [ 0.901899] Callback from call_rcu_tasks_trace() invoked. [ 0.983829] Callback from call_rcu_tasks_rude() invoked. [ 1.072003] .. bad ring buffer .. corrupted trace buffer .. [ 1.091944] Callback from call_rcu_tasks() invoked. [ 1.097695] PASSED [ 1.097701] Testing dynamic ftrace: .. filter failed count=0 ..FAILED! [ 1.353474] ------------[ cut here ]------------ [ 1.353478] WARNING: CPU: 0 PID: 1 at kernel/trace/trace.c:1951 run_tracer_selftest+0x13c/0x1b0

Therefore enforce a minimum of 4096 bytes to make the selftest pass.

Link: https://lkml.kernel.org/r/20220214134456.1751749-1-svens@linux.ibm.com

Signed-off-by: Sven Schnelle svens@linux.ibm.com Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/trace/trace.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index bb15059020445..9b5abe11be5c0 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1472,10 +1472,12 @@ static int __init set_buf_size(char *str) if (!str) return 0; buf_size = memparse(str, &str); - /* nr_entries can not be zero */ - if (buf_size == 0) - return 0; - trace_buf_size = buf_size; + /* + * nr_entries can not be zero and the startup + * tests require some buffer space. Therefore + * ensure we have at least 4096 bytes of buffer. + */ + trace_buf_size = max(4096UL, buf_size); return 1; } __setup("trace_buf_size=", set_buf_size);

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 25/28] tracing/osnoise: Make osnoise_main to sleep for microseconds

From: Daniel Bristot de Oliveira bristot@kernel.org

[ Upstream commit dd990352f01ee9a6c6eee152e5d11c021caccfe4 ]

osnoise's runtime and period are in the microseconds scale, but it is currently sleeping in the millisecond's scale. This behavior roots in the usage of hwlat as the skeleton for osnoise.

Make osnoise to sleep in the microseconds scale. Also, move the sleep to a specialized function.

Link: https://lkml.kernel.org/r/302aa6c7bdf2d131719b22901905e9da122a11b2.164519733...

Cc: Ingo Molnar mingo@redhat.com Signed-off-by: Daniel Bristot de Oliveira bristot@kernel.org Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/trace/trace_osnoise.c | 53 ++++++++++++++++++++++-------------- 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index b58674e8644a6..58c788b0ca27c 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -1437,6 +1437,37 @@ static int run_osnoise(void) static struct cpumask osnoise_cpumask; static struct cpumask save_cpumask;

+/* + * osnoise_sleep - sleep until the next period + */ +static void osnoise_sleep(void) +{ + u64 interval; + ktime_t wake_time; + + mutex_lock(&interface_lock); + interval = osnoise_data.sample_period - osnoise_data.sample_runtime; + mutex_unlock(&interface_lock); + + /* + * differently from hwlat_detector, the osnoise tracer can run + * without a pause because preemption is on. + */ + if (!interval) { + /* Let synchronize_rcu_tasks() make progress */ + cond_resched_tasks_rcu_qs(); + return; + } + + wake_time = ktime_add_us(ktime_get(), interval); + __set_current_state(TASK_INTERRUPTIBLE); + + while (schedule_hrtimeout_range(&wake_time, 0, HRTIMER_MODE_ABS)) { + if (kthread_should_stop()) + break; + } +} + /* * osnoise_main - The osnoise detection kernel thread * @@ -1445,30 +1476,10 @@ static struct cpumask save_cpumask; */ static int osnoise_main(void *data) { - u64 interval;

while (!kthread_should_stop()) { - run_osnoise(); - - mutex_lock(&interface_lock); - interval = osnoise_data.sample_period - osnoise_data.sample_runtime; - mutex_unlock(&interface_lock); - - do_div(interval, USEC_PER_MSEC); - - /* - * differently from hwlat_detector, the osnoise tracer can run - * without a pause because preemption is on. - */ - if (interval < 1) { - /* Let synchronize_rcu_tasks() make progress */ - cond_resched_tasks_rcu_qs(); - continue; - } - - if (msleep_interruptible(interval)) - break; + osnoise_sleep(); }

return 0;

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 26/28] tracing: Fix selftest config check for function graph start up test

From: Christophe Leroy christophe.leroy@csgroup.eu

[ Upstream commit c5229a0bd47814770c895e94fbc97ad21819abfe ]

CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is required to test direct tramp.

Link: https://lkml.kernel.org/r/bdc7e594e13b0891c1d61bc8d56c94b1890eaed7.164001796...

Signed-off-by: Christophe Leroy christophe.leroy@csgroup.eu Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/trace/trace_selftest.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index afd937a46496e..abcadbe933bb7 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -784,9 +784,7 @@ static struct fgraph_ops fgraph_ops __initdata = { .retfunc = &trace_graph_return, };

-#if defined(CONFIG_DYNAMIC_FTRACE) && \ - defined(CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS) -#define TEST_DIRECT_TRAMP +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS noinline __noclone static void trace_direct_tramp(void) { } #endif

@@ -849,7 +847,7 @@ trace_selftest_startup_function_graph(struct tracer *trace, goto out; }

-#ifdef TEST_DIRECT_TRAMP +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS tracing_reset_online_cpus(&tr->array_buffer); set_graph_array(tr);

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 27/28] selftest/vm: fix map_fixed_noreplace test failure

From: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com

[ Upstream commit f39c58008dee7ab5fc94c3f1995a21e886801df0 ]

On the latest RHEL the test fails due to executable mapped at 256MB address

# ./map_fixed_noreplace mmap() @ 0x10000000-0x10050000 p=0xffffffffffffffff result=File exists 10000000-10010000 r-xp 00000000 fd:04 34905657 /root/rpmbuild/BUILD/kernel-5.14.0-56.el9/linux-5.14.0-56.el9.ppc64le/tools/testing/selftests/vm/map_fixed_noreplace 10010000-10020000 r--p 00000000 fd:04 34905657 /root/rpmbuild/BUILD/kernel-5.14.0-56.el9/linux-5.14.0-56.el9.ppc64le/tools/testing/selftests/vm/map_fixed_noreplace 10020000-10030000 rw-p 00010000 fd:04 34905657 /root/rpmbuild/BUILD/kernel-5.14.0-56.el9/linux-5.14.0-56.el9.ppc64le/tools/testing/selftests/vm/map_fixed_noreplace 10029b90000-10029bc0000 rw-p 00000000 00:00 0 [heap] 7fffbb510000-7fffbb750000 r-xp 00000000 fd:04 24534 /usr/lib64/libc.so.6 7fffbb750000-7fffbb760000 r--p 00230000 fd:04 24534 /usr/lib64/libc.so.6 7fffbb760000-7fffbb770000 rw-p 00240000 fd:04 24534 /usr/lib64/libc.so.6 7fffbb780000-7fffbb7a0000 r--p 00000000 00:00 0 [vvar] 7fffbb7a0000-7fffbb7b0000 r-xp 00000000 00:00 0 [vdso] 7fffbb7b0000-7fffbb800000 r-xp 00000000 fd:04 24514 /usr/lib64/ld64.so.2 7fffbb800000-7fffbb810000 r--p 00040000 fd:04 24514 /usr/lib64/ld64.so.2 7fffbb810000-7fffbb820000 rw-p 00050000 fd:04 24514 /usr/lib64/ld64.so.2 7fffd93f0000-7fffd9420000 rw-p 00000000 00:00 0 [stack] Error: couldn't map the space we need for the test

Fix this by finding a free address using mmap instead of hardcoding BASE_ADDRESS.

Link: https://lkml.kernel.org/r/20220217083417.373823-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Jann Horn jannh@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- .../selftests/vm/map_fixed_noreplace.c | 49 ++++++++++++++----- 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/vm/map_fixed_noreplace.c b/tools/testing/selftests/vm/map_fixed_noreplace.c index d91bde5112686..eed44322d1a63 100644 --- a/tools/testing/selftests/vm/map_fixed_noreplace.c +++ b/tools/testing/selftests/vm/map_fixed_noreplace.c @@ -17,9 +17,6 @@ #define MAP_FIXED_NOREPLACE 0x100000 #endif

-#define BASE_ADDRESS (256ul * 1024 * 1024) - - static void dump_maps(void) { char cmd[32]; @@ -28,18 +25,46 @@ static void dump_maps(void) system(cmd); }

+static unsigned long find_base_addr(unsigned long size) +{ + void *addr; + unsigned long flags; + + flags = MAP_PRIVATE | MAP_ANONYMOUS; + addr = mmap(NULL, size, PROT_NONE, flags, -1, 0); + if (addr == MAP_FAILED) { + printf("Error: couldn't map the space we need for the test\n"); + return 0; + } + + if (munmap(addr, size) != 0) { + printf("Error: couldn't map the space we need for the test\n"); + return 0; + } + return (unsigned long)addr; +} + int main(void) { + unsigned long base_addr; unsigned long flags, addr, size, page_size; char *p;

page_size = sysconf(_SC_PAGE_SIZE);

+ //let's find a base addr that is free before we start the tests + size = 5 * page_size; + base_addr = find_base_addr(size); + if (!base_addr) { + printf("Error: couldn't map the space we need for the test\n"); + return 1; + } + flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED_NOREPLACE;

// Check we can map all the areas we need below errno = 0; - addr = BASE_ADDRESS; + addr = base_addr; size = 5 * page_size; p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0);

@@ -60,7 +85,7 @@ int main(void) printf("unmap() successful\n");

errno = 0; - addr = BASE_ADDRESS + page_size; + addr = base_addr + page_size; size = 3 * page_size; p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0); printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p); @@ -80,7 +105,7 @@ int main(void) * +4 | free | new */ errno = 0; - addr = BASE_ADDRESS; + addr = base_addr; size = 5 * page_size; p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0); printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p); @@ -101,7 +126,7 @@ int main(void) * +4 | free | */ errno = 0; - addr = BASE_ADDRESS + (2 * page_size); + addr = base_addr + (2 * page_size); size = page_size; p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0); printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p); @@ -121,7 +146,7 @@ int main(void) * +4 | free | new */ errno = 0; - addr = BASE_ADDRESS + (3 * page_size); + addr = base_addr + (3 * page_size); size = 2 * page_size; p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0); printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p); @@ -141,7 +166,7 @@ int main(void) * +4 | free | */ errno = 0; - addr = BASE_ADDRESS; + addr = base_addr; size = 2 * page_size; p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0); printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p); @@ -161,7 +186,7 @@ int main(void) * +4 | free | */ errno = 0; - addr = BASE_ADDRESS; + addr = base_addr; size = page_size; p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0); printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p); @@ -181,7 +206,7 @@ int main(void) * +4 | free | new */ errno = 0; - addr = BASE_ADDRESS + (4 * page_size); + addr = base_addr + (4 * page_size); size = page_size; p = mmap((void *)addr, size, PROT_NONE, flags, -1, 0); printf("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p); @@ -192,7 +217,7 @@ int main(void) return 1; }

- addr = BASE_ADDRESS; + addr = base_addr; size = 5 * page_size; if (munmap((void *)addr, size) != 0) { dump_maps();

-- 2.34.1

Sasha Levin

8:13 p.m.

New subject: [PATCH AUTOSEL 5.16 28/28] selftests/memfd: clean up mapping in mfd_fail_write

From: Mike Kravetz mike.kravetz@oracle.com

[ Upstream commit fda153c89af344d21df281009a9d046cf587ea0f ]

Running the memfd script ./run_hugetlbfs_test.sh will often end in error as follows:

memfd-hugetlb: CREATE memfd-hugetlb: BASIC memfd-hugetlb: SEAL-WRITE memfd-hugetlb: SEAL-FUTURE-WRITE memfd-hugetlb: SEAL-SHRINK fallocate(ALLOC) failed: No space left on device ./run_hugetlbfs_test.sh: line 60: 166855 Aborted (core dumped) ./memfd_test hugetlbfs opening: ./mnt/memfd fuse: DONE

If no hugetlb pages have been preallocated, run_hugetlbfs_test.sh will allocate 'just enough' pages to run the test. In the SEAL-FUTURE-WRITE test the mfd_fail_write routine maps the file, but does not unmap. As a result, two hugetlb pages remain reserved for the mapping. When the fallocate call in the SEAL-SHRINK test attempts allocate all hugetlb pages, it is short by the two reserved pages.

Fix by making sure to unmap in mfd_fail_write.

Link: https://lkml.kernel.org/r/20220219004340.56478-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz mike.kravetz@oracle.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- tools/testing/selftests/memfd/memfd_test.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 192a2899bae8f..94df2692e6e4a 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -455,6 +455,7 @@ static void mfd_fail_write(int fd) printf("mmap()+mprotect() didn't fail as expected\n"); abort(); } + munmap(p, mfd_def_size); }

/* verify PUNCH_HOLE fails */

-- 2.34.1

1309

days inactive

1408

days old

linux-stable-mirror@lists.linaro.org

42 comments

participants

tags (0)

participants (5)

Leonardo Bras Soares Passos
Paolo Bonzini
Peter Xu
Sasha Levin
Sean Christopherson