The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 92597f97a40bf661bebceb92e26ff87c76d562d4 Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki(a)intel.com>
Date: Thu, 31 Mar 2022 19:38:51 +0200
Subject: [PATCH] PCI/PM: Avoid putting Elo i2 PCIe Ports in D3cold
If a Root Port on Elo i2 is put into D3cold and then back into D0, the
downstream device becomes permanently inaccessible, so add a bridge D3 DMI
quirk for that system.
This was exposed by 14858dcc3b35 ("PCI: Use pci_update_current_state() in
pci_enable_device_flags()"), but before that commit the Root Port in
question had never been put into D3cold for real due to a mismatch between
its power state retrieved from the PCI_PM_CTRL register (which was
accessible even though the platform firmware indicated that the port was in
D3cold) and the state of an ACPI power resource involved in its power
management.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
Link: https://lore.kernel.org/r/11980172.O9o76ZdvQC@kreacher
Reported-by: Stefan Gottwald <gottwald(a)igel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas(a)google.com>
Cc: stable(a)vger.kernel.org # v5.15+
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 9ecce435fb3f..d25122fbe98a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge_d3_blacklist[] = {
DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
},
+ /*
+ * Downstream device is not accessible after putting a root port
+ * into D3cold and back into D0 on Elo i2.
+ */
+ .ident = "Elo i2",
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
+ DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
+ DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
+ },
},
#endif
{ }
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 92597f97a40bf661bebceb92e26ff87c76d562d4 Mon Sep 17 00:00:00 2001
From: "Rafael J. Wysocki" <rafael.j.wysocki(a)intel.com>
Date: Thu, 31 Mar 2022 19:38:51 +0200
Subject: [PATCH] PCI/PM: Avoid putting Elo i2 PCIe Ports in D3cold
If a Root Port on Elo i2 is put into D3cold and then back into D0, the
downstream device becomes permanently inaccessible, so add a bridge D3 DMI
quirk for that system.
This was exposed by 14858dcc3b35 ("PCI: Use pci_update_current_state() in
pci_enable_device_flags()"), but before that commit the Root Port in
question had never been put into D3cold for real due to a mismatch between
its power state retrieved from the PCI_PM_CTRL register (which was
accessible even though the platform firmware indicated that the port was in
D3cold) and the state of an ACPI power resource involved in its power
management.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
Link: https://lore.kernel.org/r/11980172.O9o76ZdvQC@kreacher
Reported-by: Stefan Gottwald <gottwald(a)igel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas(a)google.com>
Cc: stable(a)vger.kernel.org # v5.15+
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 9ecce435fb3f..d25122fbe98a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge_d3_blacklist[] = {
DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
},
+ /*
+ * Downstream device is not accessible after putting a root port
+ * into D3cold and back into D0 on Elo i2.
+ */
+ .ident = "Elo i2",
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
+ DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
+ DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
+ },
},
#endif
{ }
When booting with ACPI unavailable or disabled, get_smp_config() ends up
calling MP_processor_info() for each CPU found in the MPS
table. Previously, this resulted in boot_cpu_physical_apicid getting
unconditionally overwritten by the apicid of whatever processor had the
CPU_BOOTPROCESSOR flag. This occurred even if boot_cpu_physical_apicid
had already been more reliably determined in register_lapic_address() by
calling read_apic_id() from the actual boot processor.
Ordinariliy, this is not a problem because the boot processor really is
the one with the CPU_BOOTPROCESSOR flag. However, kexec is an exception
in which the kernel may be booted from any processor regardless of the
MPS table contents. In this case, boot_cpu_physical_apicid may not
indicate the actual boot processor.
This was particularly problematic when the second kernel was booted with
NR_CPUS fewer than the number of physical processors. It's the job of
generic_processor_info() to decide which CPUs to bring up in this case.
That obviously must include the real boot processor which it takes care
to save a slot for. It relies upon the contents of
boot_cpu_physical_apicid to do this, which if incorrect, may result in
the boot processor getting left out.
This condition can be discovered by smp_sanity_check() and rectified by
adding the boot processor to the phys_cpu_present_map with the warning
"weird, boot CPU (#%d) not listed by the BIOS". However, commit
3e730dad3b6da ("x86/apic: Unify interrupt mode setup for SMP-capable
system") caused setup_local_APIC() to be called before this could happen
resulting in a BUG_ON(!apic->apic_id_registered()):
[ 0.655452] ------------[ cut here ]------------
[ 0.660610] Kernel BUG at setup_local_APIC+0x74/0x280 [verbose debug info unavailable]
[ 0.669466] invalid opcode: 0000 [#1] SMP
[ 0.673948] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.109.Ar-16509018.eostrunkkernel419 #1
[ 0.683670] Hardware name: Quanta Quanta LY6 (1LY6UZZ0FBC), BIOS 1.0.6.0-e7d6a55 11/26/2015
[ 0.693007] RIP: 0010:setup_local_APIC+0x74/0x280
[ 0.698264] Code: 80 e4 fe bf f0 00 00 00 89 c6 48 8b 05 0f 1a 8e 00 ff 50 10 e8 12 53 fd ff 48 8b 05 00 1a 8e 00 ff 90 a0 00 00 00 85 c0 75 02 <0f> 0b 48 8b 05 ed 19 8e 00 41 be 00 02 00 00 ff 90 b0 00 00 00 48
[ 0.719251] RSP: 0000:ffffffff81a03e20 EFLAGS: 00010246
[ 0.725091] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
[ 0.733066] RDX: 0000000000000000 RSI: 000000000000000f RDI: 0000000000000020
[ 0.741041] RBP: ffffffff81a03e98 R08: 0000000000000002 R09: 0000000000000000
[ 0.749014] R10: ffffffff81a204e0 R11: ffffffff81b50ea7 R12: 0000000000000000
[ 0.756989] R13: ffffffff81aef920 R14: ffffffff81af60a0 R15: 0000000000000000
[ 0.764965] FS: 0000000000000000(0000) GS:ffff888036800000(0000) knlGS:0000000000000000
[ 0.774007] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.780427] CR2: ffff888035c01000 CR3: 0000000035a08000 CR4: 00000000000006b0
[ 0.788401] Call Trace:
[ 0.791137] ? amd_iommu_prepare+0x15/0x2a
[ 0.795717] apic_bsp_setup+0x55/0x75
[ 0.799808] apic_intr_mode_init+0x169/0x16e
[ 0.804579] x86_late_time_init+0x10/0x17
[ 0.809062] start_kernel+0x37e/0x3fe
[ 0.813154] x86_64_start_reservations+0x2a/0x2c
[ 0.818316] x86_64_start_kernel+0x72/0x75
[ 0.822886] secondary_startup_64+0xa4/0xb0
[ 0.827564] ---[ end trace 237b64da0fd9b22e ]---
This change avoids these issues by only setting boot_cpu_physical_apicid
from the MPS table if it is not already set, which can occur in the
construct_default_ISA_mptable() path. Otherwise,
boot_cpu_physical_apicid will already have been set in
register_lapic_address() and should therefore remain untouched.
Looking through all the places where boot_cpu_physical_apicid is
accessed, nearly all of them assume that boot_cpu_physical_apicid should
match read_apic_id() on the booting processor. The only place that might
intend to use the BSP apicid listed in the MPS table is amd_numa_init(),
which explicitly requires boot_cpu_physical_apicid to be the lowest
apicid of all processors. Ironically, due to the early exit short
circuit in early_get_smp_config(), it instead gets
boot_cpu_physical_apicid = read_apic_id() rather than the MPS table
BSP. The behaviour of amd_numa_init() is therefore unaffected by this
change.
Fixes: 3e730dad3b6da ("x86/apic: Unify interrupt mode setup for SMP-capable system")
Signed-off-by: Kevin Mitchell <kevmitch(a)arista.com>
Cc: <stable(a)vger.kernel.org>
---
arch/x86/kernel/mpparse.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/mpparse.c b/arch/x86/kernel/mpparse.c
index afac7ccce72f..6f22f09bfe11 100644
--- a/arch/x86/kernel/mpparse.c
+++ b/arch/x86/kernel/mpparse.c
@@ -64,7 +64,8 @@ static void __init MP_processor_info(struct mpc_cpu *m)
if (m->cpuflag & CPU_BOOTPROCESSOR) {
bootup_cpu = " (Bootup-CPU)";
- boot_cpu_physical_apicid = m->apicid;
+ if (boot_cpu_physical_apicid == -1U)
+ boot_cpu_physical_apicid = m->apicid;
}
pr_info("Processor #%d%s\n", m->apicid, bootup_cpu);
--
2.26.2
Hello
I noticed today, that commit eccac77ede394 ("clk: imx7d: Remove
audio_mclk_root_clk") was backported to stable kernels. This commit by
itself does break audio functionality on i.MX 7 based boards. [1]
For this to work commit 4cb7df64c73 ("ARM: dts: imx7: Use
audio_mclk_post_div instead audio_mclk_root_clk") is also needed but is
lacking a Fixes: tag, hence it is missing in stable kernels. [2]
I noticed this while merging stable-patches into our downstream-
branches.
@Abel can you confirm my finding?
@Greg could you pull patch [2] into kernels where [1] got merged?
Thanks,
Philippe
[1]
https://lore.kernel.org/all/20220127141052.1900174-2-abel.vesa@nxp.com/
[2]
https://lore.kernel.org/all/20220127141052.1900174-1-abel.vesa@nxp.com/
This is a note to let you know that I've just added the patch titled
USB: new quirk for Dell Gen 2 devices
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
From 97fa5887cf283bb75ffff5f6b2c0e71794c02400 Mon Sep 17 00:00:00 2001
From: Monish Kumar R <monish.kumar.r(a)intel.com>
Date: Fri, 20 May 2022 18:30:44 +0530
Subject: USB: new quirk for Dell Gen 2 devices
Add USB_QUIRK_NO_LPM and USB_QUIRK_RESET_RESUME quirks for Dell usb gen
2 device to not fail during enumeration.
Found this bug on own testing
Signed-off-by: Monish Kumar R <monish.kumar.r(a)intel.com>
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20220520130044.17303-1-monish.kumar.r@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/core/quirks.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c
index 97b44a68668a..f99a65a64588 100644
--- a/drivers/usb/core/quirks.c
+++ b/drivers/usb/core/quirks.c
@@ -510,6 +510,9 @@ static const struct usb_device_id usb_quirk_list[] = {
/* DJI CineSSD */
{ USB_DEVICE(0x2ca3, 0x0031), .driver_info = USB_QUIRK_NO_LPM },
+ /* DELL USB GEN2 */
+ { USB_DEVICE(0x413c, 0xb062), .driver_info = USB_QUIRK_NO_LPM | USB_QUIRK_RESET_RESUME },
+
/* VCOM device */
{ USB_DEVICE(0x4296, 0x7570), .driver_info = USB_QUIRK_CONFIG_INTF_STRINGS },
--
2.36.1
This is a note to let you know that I've just added the patch titled
USB: new quirk for Dell Gen 2 devices
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-testing branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will be merged to the usb-next branch sometime soon,
after it passes testing, and the merge window is open.
If you have any questions about this process, please let me know.
From 97fa5887cf283bb75ffff5f6b2c0e71794c02400 Mon Sep 17 00:00:00 2001
From: Monish Kumar R <monish.kumar.r(a)intel.com>
Date: Fri, 20 May 2022 18:30:44 +0530
Subject: USB: new quirk for Dell Gen 2 devices
Add USB_QUIRK_NO_LPM and USB_QUIRK_RESET_RESUME quirks for Dell usb gen
2 device to not fail during enumeration.
Found this bug on own testing
Signed-off-by: Monish Kumar R <monish.kumar.r(a)intel.com>
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20220520130044.17303-1-monish.kumar.r@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/core/quirks.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c
index 97b44a68668a..f99a65a64588 100644
--- a/drivers/usb/core/quirks.c
+++ b/drivers/usb/core/quirks.c
@@ -510,6 +510,9 @@ static const struct usb_device_id usb_quirk_list[] = {
/* DJI CineSSD */
{ USB_DEVICE(0x2ca3, 0x0031), .driver_info = USB_QUIRK_NO_LPM },
+ /* DELL USB GEN2 */
+ { USB_DEVICE(0x413c, 0xb062), .driver_info = USB_QUIRK_NO_LPM | USB_QUIRK_RESET_RESUME },
+
/* VCOM device */
{ USB_DEVICE(0x4296, 0x7570), .driver_info = USB_QUIRK_CONFIG_INTF_STRINGS },
--
2.36.1
When multiplying of different types, an overflow is possible even
when storing the result in a larger type. This is because the
conversion is done after the multiplication. So arithmetic
overflow and thus in incorrect value is possible.
Correct an instance of this in the inter packet delay calculation.
Fix by ensuring one of the operands is u64 which will promote the
other to u64 as well ensuring no overflow.
Fixes: 7724105686e7 ("IB/hfi1: add driver files")
Cc: stable(a)vger.kernel.org
Reviewed-by: Mike Marciniszyn <mike.marciniszyn(a)cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro(a)cornelisnetworks.com>
---
drivers/infiniband/hw/hfi1/init.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index 4436ed4..436372b 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -489,7 +489,7 @@ void set_link_ipg(struct hfi1_pportdata *ppd)
u16 shift, mult;
u64 src;
u32 current_egress_rate; /* Mbits /sec */
- u32 max_pkt_time;
+ u64 max_pkt_time;
/*
* max_pkt_time is the maximum packet egress time in units
* of the fabric clock period 1/(805 MHz).
Some GPIO lines have stopped working after the patch
commit 2ab73c6d8323f ("gpio: Support GPIO controllers without pin-ranges")
And this has supposedly been fixed in the following patches
commit 89ad556b7f96a ("gpio: Avoid using pin ranges with !PINCTRL")
commit 6dbbf84603961 ("gpiolib: Don't free if pin ranges are not defined")
But an erratic behavior where some GPIO lines work while others do not work
has been introduced.
This patch reverts those changes so that the sysfs-gpio interface works
properly again.
Signed-off-by: Marcelo Roberto Jimenez <marcelo.jimenez(a)gmail.com>
---
Hi,
My system is ARM926EJ-S rev 5 (v5l) (AT91SAM9G25), the board is an ACME Systems Arietta.
The system used sysfs-gpio to manage a few gpio lines, and I have noticed that some have stopped working.
The test script is very simple:
#! /bin/bash
cd /sys/class/gpio/
echo 24 > export
cd pioA24
echo out > direction
echo 0 > value
cat value
echo 1 > value
cat value
echo 0 > value
cat value
echo 1 > value
cat value
cd ..
echo 24 > unexport
In a "good" kernel, this script outputs 0, 1, 0, 1. In a bad kernel, the output result is 1, 1, 1, 1. Also it must be possible to run this script twice without errors, that was the issue with the gpiochip_generic_free() call that had been addressed in another patch.
In my system PINCTRL is automatically selected by
SOC_AT91SAM9 [=y] && ARCH_AT91 [=y] && ARCH_MULTI_V5 [=y]
So it is not an option to disable it to make it work.
Best regards,
Marcelo.
drivers/gpio/gpiolib.c | 10 ----------
1 file changed, 10 deletions(-)
diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index af5bb8fedfea..ac69ec8fb37a 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -1804,11 +1804,6 @@ static inline void gpiochip_irqchip_free_valid_mask(struct gpio_chip *gc)
*/
int gpiochip_generic_request(struct gpio_chip *gc, unsigned offset)
{
-#ifdef CONFIG_PINCTRL
- if (list_empty(&gc->gpiodev->pin_ranges))
- return 0;
-#endif
-
return pinctrl_gpio_request(gc->gpiodev->base + offset);
}
EXPORT_SYMBOL_GPL(gpiochip_generic_request);
@@ -1820,11 +1815,6 @@ EXPORT_SYMBOL_GPL(gpiochip_generic_request);
*/
void gpiochip_generic_free(struct gpio_chip *gc, unsigned offset)
{
-#ifdef CONFIG_PINCTRL
- if (list_empty(&gc->gpiodev->pin_ranges))
- return;
-#endif
-
pinctrl_gpio_free(gc->gpiodev->base + offset);
}
EXPORT_SYMBOL_GPL(gpiochip_generic_free);
--
2.30.2
On 5/20/2022 10:06 AM, Jan Beulich wrote:
> On 20.05.2022 15:33, Chuck Zmudzinski wrote:
>> On 5/20/2022 5:41 AM, Jan Beulich wrote:
>>> On 20.05.2022 10:30, Chuck Zmudzinski wrote:
>>>> On 5/20/2022 2:59 AM, Chuck Zmudzinski wrote:
>>>>> On 5/20/2022 2:05 AM, Jan Beulich wrote:
>>>>>> On 20.05.2022 06:43, Chuck Zmudzinski wrote:
>>>>>>> On 5/4/22 5:14 AM, Juergen Gross wrote:
>>>>>>>> On 04.05.22 10:31, Jan Beulich wrote:
>>>>>>>>> On 03.05.2022 15:22, Juergen Gross wrote:
>>>>>>>>>
>>>>>>>>> ... these uses there are several more. You say nothing on why
>>>>>>>>> those want
>>>>>>>>> leaving unaltered. When preparing my earlier patch I did inspect them
>>>>>>>>> and came to the conclusion that these all would also better
>>>>>>>>> observe the
>>>>>>>>> adjusted behavior (or else I couldn't have left pat_enabled() as the
>>>>>>>>> only predicate). In fact, as said in the description of my earlier
>>>>>>>>> patch, in
>>>>>>>>> my debugging I did find the use in i915_gem_object_pin_map() to be
>>>>>>>>> the
>>>>>>>>> problematic one, which you leave alone.
>>>>>>>> Oh, I missed that one, sorry.
>>>>>>> That is why your patch would not fix my Haswell unless
>>>>>>> it also touches i915_gem_object_pin_map() in
>>>>>>> drivers/gpu/drm/i915/gem/i915_gem_pages.c
>>>>>>>
>>>>>>>> I wanted to be rather defensive in my changes, but I agree at least
>>>>>>>> the
>>>>>>>> case in arch_phys_wc_add() might want to be changed, too.
>>>>>>> I think your approach needs to be more aggressive so it will fix
>>>>>>> all the known false negatives introduced by bdd8b6c98239
>>>>>>> such as the one in i915_gem_object_pin_map().
>>>>>>>
>>>>>>> I looked at Jan's approach and I think it would fix the issue
>>>>>>> with my Haswell as long as I don't use the nopat option. I
>>>>>>> really don't have a strong opinion on that question, but I
>>>>>>> think the nopat option as a Linux kernel option, as opposed
>>>>>>> to a hypervisor option, should only affect the kernel, and
>>>>>>> if the hypervisor provides the pat feature, then the kernel
>>>>>>> should not override that,
>>>>>> Hmm, why would the kernel not be allowed to override that? Such
>>>>>> an override would affect only the single domain where the
>>>>>> kernel runs; other domains could take their own decisions.
>>>>>>
>>>>>> Also, for the sake of completeness: "nopat" used when running on
>>>>>> bare metal has the same bad effect on system boot, so there
>>>>>> pretty clearly is an error cleanup issue in the i915 driver. But
>>>>>> that's orthogonal, and I expect the maintainers may not even care
>>>>>> (but tell us "don't do that then").
>>>> Actually I just did a test with the last official Debian kernel
>>>> build of Linux 5.16, that is, a kernel before bdd8b6c98239 was
>>>> applied. In fact, the nopat option does *not* break the i915 driver
>>>> in 5.16. That is, with the nopat option, the i915 driver loads
>>>> normally on both the bare metal and on the Xen hypervisor.
>>>> That means your presumption (and the presumption of
>>>> the author of bdd8b6c98239) that the "nopat" option was
>>>> being observed by the i915 driver is incorrect. Setting "nopat"
>>>> had no effect on my system with Linux 5.16. So after doing these
>>>> tests, I am against the aggressive approach of breaking the i915
>>>> driver with the "nopat" option because prior to bdd8b6c98239,
>>>> nopat did not break the i915 driver. Why break it now?
>>> Because that's, in my understanding, is the purpose of "nopat"
>>> (not breaking the driver of course - that's a driver bug -, but
>>> having an effect on the driver).
>> I wouldn't call it a driver bug, but an incorrect configuration of the
>> kernel by the user. I presume X86_FEATURE_PAT is required by the
>> i915 driver
> The driver ought to work fine without PAT (and hence without being
> able to make WC mappings). It would use UC instead and be slow, but
> it ought to work.
I am not an expert, but I think the reason it failed on my box was
because of the requirements of CI. Maybe the driver would fall back
to UC if the add_taint_for_CI function did not halt the entire system
in response to the failed test for PAT when trying to use WC mappings.
>> and therefore the driver should refuse to disable
>> it if the user requests to disable it and instead warn the user that
>> the driver did not disable the feature, contrary to what the user
>> requested with the nopat option.
>>
>> In any case, my test did not verify that when nopat is set in Linux 5.16,
>> the thread takes the same code path as when nopat is not set,
>> so I am not totally sure that the reason nopat does not break the
>> i915 driver in 5.16 is that static_cpu_has(X86_FEATURE_PAT)
>> returns true even when nopat is set. I could test it with a custom
>> log message in 5.16 if that is necessary.
>>
>> Are you saying it was wrong for
>> to return true in 5.16 when the user requests nopat?
> No, I'm not saying that. It was wrong for this construct to be used
> in the driver, which was fixed for 5.17 (and which had caused the
> regression I did observe, leading to the patch as a hopefully least
> bad option).
Hmm, the patch I used to fix my box with 5.17.6 used
static_cpu_has(X86_FEATURE_PAT) so the driver could
continue to configure the hardware using WC. This is the
relevant part of the patch I used to fix my box, which includes
extra error logs, (against Debian's official build of 5.17.6):
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c 2022-05-09
03:16:33.000000000 -0400
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c 2022-05-19
15:55:40.339778818 -0400
...
@@ -430,17 +434,23 @@
err = i915_gem_object_wait_moving_fence(obj, true);
if (err) {
ptr = ERR_PTR(err);
+ DRM_ERROR("i915_gem_object_wait_moving_fence error, err =
%d\n", err);
goto err_unpin;
}
- if (GEM_WARN_ON(type == I915_MAP_WC && !pat_enabled()))
+ if (GEM_WARN_ON(type == I915_MAP_WC &&
+ !pat_enabled() && !static_cpu_has(X86_FEATURE_PAT))) {
+ DRM_ERROR("type == I915_MAP_WC && !pat_enabled(), err =
%d\n", -ENODEV);
ptr = ERR_PTR(-ENODEV);
+ }
else if (i915_gem_object_has_struct_page(obj))
ptr = i915_gem_object_map_page(obj, type);
else
ptr = i915_gem_object_map_pfn(obj, type);
- if (IS_ERR(ptr))
+ if (IS_ERR(ptr)) {
+ DRM_ERROR("IS_ERR(PTR) is true, returning a (ptr) error\n");
goto err_unpin;
+ }
obj->mm.mapping = page_pack_bits(ptr, type);
}
As you can see, adding the static_cpu_has(X86_FEATURE_PAT)
function to the test for PAT restored the behavior of 5.16 on the
Xen hypervisor to 5.17, and that is how I discovered the solution
to this problem on 5.17 on my box.
>> I think that is
>> just permitting a bad configuration to break the driver that a
>> well-written operating system should not allow. The i915 driver
>> was, in my opinion, correctly ignoring the nopat option in 5.16
>> because that option is not compatible with the hardware the
>> i915 driver is trying to initialize and setup at boot time. At least
>> that is my understanding now, but I will need to test it on 5.16
>> to be sure I understand it correctly.
>>
>> Also, AFAICT, your patch would break the driver when the nopat
>> option is set and only fix the regression introduced by bdd8b6c98239
>> when nopat is not set on my box, so your patch would
>> introduce a regression relative to Linux 5.16 and earlier for the
>> case when nopat is set on my box. I think your point would
>> be that it is not a regression if it is an incorrect user configuration.
> Again no - my view is that there's a separate, pre-existing issue
> in the driver which was uncovered by the change. This may be a
> perceived regression, but is imo different from a real one.
Maybe it is only a perceived regression if nopat is set, but
imo bdd8b6c98239 introduced a real regression in 5.17
relative to 5.16 for the correctly and identically configured
case when the nopat option is not set. That is why I still think
it should be reverted and the fix backported to 5.17 until the
regression for the case when nopat is not set is fixed. As I
said before, the i915 driver relies on the memory subsyste
to provide it with an accurate test for the x86 pat feature.
The test the driver used in bdd8b6c98239 gives the i915 driver
a false negative, and that caused a real regression when nopat
is not set. bdd8b6c98239 can be re-applied if we apply your
patch which corrects the false negative that pat_enabled() is
currently providing the i915 driver with. That false negative
from pat_enabled() is not an i915 bug, it is a bug in x86/pat.
Chuck