Hi,
On Tue, Feb 13, 2018 at 3:32 PM, James Hogan <jhogan(a)kernel.org> wrote:
> On Tue, Feb 13, 2018 at 03:03:24PM +0100, Mathieu Malaterre wrote:
>> James,
>>
>> On Tue, Feb 13, 2018 at 2:38 PM, James Hogan <jhogan(a)kernel.org> wrote:
>> > On Tue, Feb 13, 2018 at 01:14:29PM +0100, Mathieu Malaterre wrote:
>> >> Could you please review the patch v3 ?
>> >
>> > Yes, looks good to me, and Ralf had applied to his test branch so I
>> > presume he's happy with it too. I'll apply for 4.16.
>>
>> Hum, just to be sure I understand the process. Which branch are you
>> talking about:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/ralf/linux.git
>
> I was referring to upstream-sfr.git branch=mips-next-test
> https://git.linux-mips.org/cgit/ralf/upstream-sfr.git/log/?h=mips-next-test
>
> (The mips-next branch there is what Ralf puts into linux-next)
>
> I've applied the patch to my mips-fixes branch here:
> git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips.git
>
> Sorry it seems a bit haphazard with multiple trees in use.
I see it now, sorry for the noise. I was not looking at the right location.
Anyway if that answer earlier question 4.11 should be correct, since I
asked Greg to not backport the earlier patch:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1505915.html
From: Eric Biggers <ebiggers(a)google.com>
Memslots must not overlap in guest physical memory, since otherwise some
guest physical addresses will not uniquely map to a memslot. Yet, the
overlap check in __kvm_set_memory_region() allows a memslot that
overlaps one of the "private" memslots, e.g. the memslot reserved for
the TSS on x86.
This seems to be a very old bug that was introduced years ago when
private memory slots were first added. It seems that later refactoring
incorrectly assumed this bug was intentional and preserved it.
Fix it by removing the loophole for private memslots, so we just check
for overlap against all memslots.
This bug was found by syzkaller, which used a memslot overlap to make
pte_list_remove() be called for the wrong memslot, hitting a BUG():
pte_list_remove: 000000007185ed42 0->BUG
kernel BUG at arch/x86/kvm/mmu.c:1209!
[...]
RIP: 0010:pte_list_remove+0x107/0x110 arch/x86/kvm/mmu.c:1208
[...]
Call Trace:
mmu_page_zap_pte+0x7e/0xd0 arch/x86/kvm/mmu.c:2499
kvm_mmu_page_unlink_children arch/x86/kvm/mmu.c:2521 [inline]
kvm_mmu_prepare_zap_page+0x4f/0x340 arch/x86/kvm/mmu.c:2565
kvm_zap_obsolete_pages arch/x86/kvm/mmu.c:5348 [inline]
kvm_mmu_invalidate_zap_all_pages+0xa6/0x100 arch/x86/kvm/mmu.c:5389
kvm_mmu_notifier_release+0x4f/0x80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:468
__mmu_notifier_release+0x63/0x100 mm/mmu_notifier.c:75
mmu_notifier_release include/linux/mmu_notifier.h:244 [inline]
exit_mmap+0x160/0x170 mm/mmap.c:3009
__mmput kernel/fork.c:966 [inline]
mmput+0x44/0xd0 kernel/fork.c:987
exit_mm kernel/exit.c:544 [inline]
do_exit+0x24a/0xb50 kernel/exit.c:856
do_group_exit+0x34/0xb0 kernel/exit.c:972
SYSC_exit_group kernel/exit.c:983 [inline]
SyS_exit_group+0xb/0x10 kernel/exit.c:981
entry_SYSCALL_64_fastpath+0x1e/0x8b
Reproducer:
#include <fcntl.h>
#include <linux/kvm.h>
#include <sys/ioctl.h>
int main()
{
static char buf[4096*3] __attribute__((aligned(4096)));
int kvm, vm, cpu;
struct kvm_mp_state mp_state = { KVM_MP_STATE_SIPI_RECEIVED };
struct kvm_userspace_memory_region memreg = {
.memory_size = sizeof(buf),
.userspace_addr = (__u64)buf,
};
kvm = open("/dev/kvm", O_RDWR);
vm = ioctl(kvm, KVM_CREATE_VM, 0);
ioctl(vm, KVM_CREATE_IRQCHIP);
cpu = ioctl(vm, KVM_CREATE_VCPU, 0);
ioctl(cpu, KVM_SET_MP_STATE, &mp_state);
ioctl(vm, KVM_SET_TSS_ADDR, 0);
ioctl(cpu, KVM_RUN, 0);
ioctl(vm, KVM_SET_USER_MEMORY_REGION, &memreg);
}
Reported-by: syzbot <syzkaller(a)googlegroups.com>
Fixes: e0d62c7f4860 ("KVM: Add kernel-internal memory slots")
Cc: <stable(a)vger.kernel.org> # v2.6.25+
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
virt/kvm/kvm_main.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 210bf820385a..e536977e7b6d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -974,8 +974,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
/* Check for overlaps */
r = -EEXIST;
kvm_for_each_memslot(slot, __kvm_memslots(kvm, as_id)) {
- if ((slot->id >= KVM_USER_MEM_SLOTS) ||
- (slot->id == id))
+ if (slot->id == id)
continue;
if (!((base_gfn + npages <= slot->base_gfn) ||
(base_gfn >= slot->base_gfn + slot->npages)))
--
2.16.0
James,
On Tue, Feb 13, 2018 at 2:38 PM, James Hogan <jhogan(a)kernel.org> wrote:
> On Tue, Feb 13, 2018 at 01:14:29PM +0100, Mathieu Malaterre wrote:
>> On Thu, Feb 1, 2018 at 1:12 PM, Mathieu Malaterre <malat(a)debian.org> wrote:
>> > On Thu, Feb 1, 2018 at 12:37 PM, Marcin Nowakowski
>> > <marcin.nowakowski(a)mips.com> wrote:
>> >> Commit 73fbc1eba7ff ("MIPS: fix mem=X@Y commandline processing") added a
>> >> fix to ensure that the memory range between PHYS_OFFSET and low memory
>> >> address specified by mem= cmdline argument is not later processed by
>> >> free_all_bootmem. This change was incorrect for systems where the
>> >> commandline specifies more than 1 mem argument, as it will cause all
>> >> memory between PHYS_OFFSET and each of the memory offsets to be marked
>> >> as reserved, which results in parts of the RAM marked as reserved
>> >> (Creator CI20's u-boot has a default commandline argument 'mem=256M@0x0
>> >> mem=768M@0x30000000').
>> >>
>> >> Change the behaviour to ensure that only the range between PHYS_OFFSET
>> >> and the lowest start address of the memories is marked as protected.
>> >>
>> >> This change also ensures that the range is marked protected even if it's
>> >> only defined through the devicetree and not only via commandline
>> >> arguments.
>> >>
>> >> Reported-by: Mathieu Malaterre <mathieu.malaterre(a)gmail.com>
>> >> Signed-off-by: Marcin Nowakowski <marcin.nowakowski(a)mips.com>
>> >> Fixes: 73fbc1eba7ff ("MIPS: fix mem=X@Y commandline processing")
>> >> Cc: <stable(a)vger.kernel.org> # v4.11+
>> >> ---
>> >> v3: Update stable version, code cleanup as suggested by James Hogan
>> >> v2: Use updated email adress, add tag for stable.
>> >> ---
>> >> arch/mips/kernel/setup.c | 16 ++++++++++++----
>> >> 1 file changed, 12 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
>> >> index 702c678de116..e4a1581ce822 100644
>> >> --- a/arch/mips/kernel/setup.c
>> >> +++ b/arch/mips/kernel/setup.c
>> >> @@ -375,6 +375,7 @@ static void __init bootmem_init(void)
>> >> unsigned long reserved_end;
>> >> unsigned long mapstart = ~0UL;
>> >> unsigned long bootmap_size;
>> >> + phys_addr_t ramstart = (phys_addr_t)ULLONG_MAX;
>> >> bool bootmap_valid = false;
>> >> int i;
>> >>
>> >> @@ -395,7 +396,8 @@ static void __init bootmem_init(void)
>> >> max_low_pfn = 0;
>> >>
>> >> /*
>> >> - * Find the highest page frame number we have available.
>> >> + * Find the highest page frame number we have available
>> >> + * and the lowest used RAM address
>> >> */
>> >> for (i = 0; i < boot_mem_map.nr_map; i++) {
>> >> unsigned long start, end;
>> >> @@ -407,6 +409,8 @@ static void __init bootmem_init(void)
>> >> end = PFN_DOWN(boot_mem_map.map[i].addr
>> >> + boot_mem_map.map[i].size);
>> >>
>> >> + ramstart = min(ramstart, boot_mem_map.map[i].addr);
>> >> +
>> >> #ifndef CONFIG_HIGHMEM
>> >> /*
>> >> * Skip highmem here so we get an accurate max_low_pfn if low
>> >> @@ -436,6 +440,13 @@ static void __init bootmem_init(void)
>> >> mapstart = max(reserved_end, start);
>> >> }
>> >>
>> >> + /*
>> >> + * Reserve any memory between the start of RAM and PHYS_OFFSET
>> >> + */
>> >> + if (ramstart > PHYS_OFFSET)
>> >> + add_memory_region(PHYS_OFFSET, ramstart - PHYS_OFFSET,
>> >> + BOOT_MEM_RESERVED);
>> >> +
>> >> if (min_low_pfn >= max_low_pfn)
>> >> panic("Incorrect memory mapping !!!");
>> >> if (min_low_pfn > ARCH_PFN_OFFSET) {
>> >> @@ -664,9 +675,6 @@ static int __init early_parse_mem(char *p)
>> >>
>> >> add_memory_region(start, size, BOOT_MEM_RAM);
>> >>
>> >> - if (start && start > PHYS_OFFSET)
>> >> - add_memory_region(PHYS_OFFSET, start - PHYS_OFFSET,
>> >> - BOOT_MEM_RESERVED);
>> >> return 0;
>> >> }
>> >> early_param("mem", early_parse_mem);
>> >> --
>> >> 2.14.1
>> >>
>> >
>> > Looks good to me:
>> >
>> > $ cat /proc/cpuinfo
>> > system type : JZ4780
>> > machine : Creator CI20
>> > processor : 0
>> > cpu model : Ingenic JZRISC V4.15 FPU V0.0
>> > BogoMIPS : 956.00
>> > wait instruction : yes
>> > microsecond timers : no
>> > tlb_entries : 32
>> > extra interrupt vector : yes
>> > hardware watchpoint : yes, count: 1, address/irw mask: [0x0fff]
>> > isa : mips1 mips2 mips32r1 mips32r2
>> > ASEs implemented :
>> > shadow register sets : 1
>> > kscratch registers : 0
>> > package : 0
>> > core : 0
>> > VCED exceptions : not available
>> > VCEI exceptions : not available
>> > $ uname -a
>> > Linux ci20 4.15.0+ #323 PREEMPT Thu Feb 1 13:08:11 CET 2018 mips GNU/Linux
>> >
>> > Tested-by: Mathieu Malaterre <malat(a)debian.org>
>> >
>> > Thanks
>>
>> Could you please review the patch v3 ?
>
> Yes, looks good to me, and Ralf had applied to his test branch so I
> presume he's happy with it too. I'll apply for 4.16.
Hum, just to be sure I understand the process. Which branch are you
talking about:
https://git.kernel.org/pub/scm/linux/kernel/git/ralf/linux.git
> Commit 73fbc1eba7ff ("MIPS: fix mem=X@Y commandline processing") which
> this fixes was evidently requested to be backported to stable (unsure
> who by) and added to the 4.9 queue, but then this arose and it was
> removed until this fix is merged (see
> https://patchwork.linux-mips.org/patch/17268/).
>
> Anybody know how far back before 4.11 both of these patches should be
> applied to stable? If not I'll just leave it at 4.11 and if its
> important before then for kexec or whatever they can be requested again.
Commit 73fbc1eba7ff ("MIPS: fix mem=X@Y commandline processing") added a
fix to ensure that the memory range between PHYS_OFFSET and low memory
address specified by mem= cmdline argument is not later processed by
free_all_bootmem. This change was incorrect for systems where the
commandline specifies more than 1 mem argument, as it will cause all
memory between PHYS_OFFSET and each of the memory offsets to be marked
as reserved, which results in parts of the RAM marked as reserved
(Creator CI20's u-boot has a default commandline argument 'mem=256M@0x0
mem=768M@0x30000000').
Change the behaviour to ensure that only the range between PHYS_OFFSET
and the lowest start address of the memories is marked as protected.
This change also ensures that the range is marked protected even if it's
only defined through the devicetree and not only via commandline
arguments.
Reported-by: Mathieu Malaterre <mathieu.malaterre(a)gmail.com>
Signed-off-by: Marcin Nowakowski <marcin.nowakowski(a)mips.com>
Fixes: 73fbc1eba7ff ("MIPS: fix mem=X@Y commandline processing")
Cc: <stable(a)vger.kernel.org> # v4.11+
---
v3: Update stable version, code cleanup as suggested by James Hogan
v2: Use updated email adress, add tag for stable.
---
arch/mips/kernel/setup.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
index 702c678de116..e4a1581ce822 100644
--- a/arch/mips/kernel/setup.c
+++ b/arch/mips/kernel/setup.c
@@ -375,6 +375,7 @@ static void __init bootmem_init(void)
unsigned long reserved_end;
unsigned long mapstart = ~0UL;
unsigned long bootmap_size;
+ phys_addr_t ramstart = (phys_addr_t)ULLONG_MAX;
bool bootmap_valid = false;
int i;
@@ -395,7 +396,8 @@ static void __init bootmem_init(void)
max_low_pfn = 0;
/*
- * Find the highest page frame number we have available.
+ * Find the highest page frame number we have available
+ * and the lowest used RAM address
*/
for (i = 0; i < boot_mem_map.nr_map; i++) {
unsigned long start, end;
@@ -407,6 +409,8 @@ static void __init bootmem_init(void)
end = PFN_DOWN(boot_mem_map.map[i].addr
+ boot_mem_map.map[i].size);
+ ramstart = min(ramstart, boot_mem_map.map[i].addr);
+
#ifndef CONFIG_HIGHMEM
/*
* Skip highmem here so we get an accurate max_low_pfn if low
@@ -436,6 +440,13 @@ static void __init bootmem_init(void)
mapstart = max(reserved_end, start);
}
+ /*
+ * Reserve any memory between the start of RAM and PHYS_OFFSET
+ */
+ if (ramstart > PHYS_OFFSET)
+ add_memory_region(PHYS_OFFSET, ramstart - PHYS_OFFSET,
+ BOOT_MEM_RESERVED);
+
if (min_low_pfn >= max_low_pfn)
panic("Incorrect memory mapping !!!");
if (min_low_pfn > ARCH_PFN_OFFSET) {
@@ -664,9 +675,6 @@ static int __init early_parse_mem(char *p)
add_memory_region(start, size, BOOT_MEM_RAM);
- if (start && start > PHYS_OFFSET)
- add_memory_region(PHYS_OFFSET, start - PHYS_OFFSET,
- BOOT_MEM_RESERVED);
return 0;
}
early_param("mem", early_parse_mem);
--
2.14.1
Commit-ID: 295cc7eb314eb3321fb6d67ca6f7305f5c50d10f
Gitweb: https://git.kernel.org/tip/295cc7eb314eb3321fb6d67ca6f7305f5c50d10f
Author: Masayoshi Mizuma <m.mizuma(a)jp.fujitsu.com>
AuthorDate: Thu, 8 Feb 2018 09:19:08 -0500
Committer: Ingo Molnar <mingo(a)kernel.org>
CommitDate: Tue, 13 Feb 2018 12:47:28 +0100
x86/smpboot: Fix uncore_pci_remove() indexing bug when hot-removing a physical CPU
When a physical CPU is hot-removed, the following warning messages
are shown while the uncore device is removed in uncore_pci_remove():
WARNING: CPU: 120 PID: 5 at arch/x86/events/intel/uncore.c:988
uncore_pci_remove+0xf1/0x110
...
CPU: 120 PID: 5 Comm: kworker/u1024:0 Not tainted 4.15.0-rc8 #1
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
...
Call Trace:
pci_device_remove+0x36/0xb0
device_release_driver_internal+0x145/0x210
pci_stop_bus_device+0x76/0xa0
pci_stop_root_bus+0x44/0x60
acpi_pci_root_remove+0x1f/0x80
acpi_bus_trim+0x54/0x90
acpi_bus_trim+0x2e/0x90
acpi_device_hotplug+0x2bc/0x4b0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x141/0x340
worker_thread+0x47/0x3e0
kthread+0xf5/0x130
When uncore_pci_remove() runs, it tries to get the package ID to
clear the value of uncore_extra_pci_dev[].dev[] by using
topology_phys_to_logical_pkg(). The warning messesages are
shown because topology_phys_to_logical_pkg() returns -1.
arch/x86/events/intel/uncore.c:
static void uncore_pci_remove(struct pci_dev *pdev)
{
...
phys_id = uncore_pcibus_to_physid(pdev->bus);
...
pkg = topology_phys_to_logical_pkg(phys_id); // returns -1
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
if (uncore_extra_pci_dev[pkg].dev[i] == pdev) {
uncore_extra_pci_dev[pkg].dev[i] = NULL;
break;
}
}
WARN_ON_ONCE(i >= UNCORE_EXTRA_PCI_DEV_MAX); // <=========== HERE!!
topology_phys_to_logical_pkg() tries to find
cpuinfo_x86->phys_proc_id that matches the phys_pkg argument.
arch/x86/kernel/smpboot.c:
int topology_phys_to_logical_pkg(unsigned int phys_pkg)
{
int cpu;
for_each_possible_cpu(cpu) {
struct cpuinfo_x86 *c = &cpu_data(cpu);
if (c->initialized && c->phys_proc_id == phys_pkg)
return c->logical_proc_id;
}
return -1;
}
However, the phys_proc_id was already set to 0 by remove_siblinginfo()
when the CPU was offlined.
So, topology_phys_to_logical_pkg() cannot find the correct
logical_proc_id and always returns -1.
As the result, uncore_pci_remove() calls WARN_ON_ONCE() and the warning
messages are shown.
What is worse is that the bogus 'pkg' index results in two bugs:
- We dereference uncore_extra_pci_dev[] with a negative index
- We fail to clean up a stale pointer in uncore_extra_pci_dev[][]
To fix these bugs, remove the clearing of ->phys_proc_id from remove_siblinginfo().
This should not cause any problems, because ->phys_proc_id is not
used after it is hot-removed and it is re-set while hot-adding.
Signed-off-by: Masayoshi Mizuma <m.mizuma(a)jp.fujitsu.com>
Acked-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: yasu.isimatu(a)gmail.com
Cc: <stable(a)vger.kernel.org>
Fixes: 30bb9811856f ("x86/topology: Avoid wasting 128k for package id array")
Link: http://lkml.kernel.org/r/ed738d54-0f01-b38b-b794-c31dc118c207@gmail.com
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
---
arch/x86/kernel/smpboot.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6f27fac..cfc61e1 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1430,7 +1430,6 @@ static void remove_siblinginfo(int cpu)
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
cpumask_clear(topology_core_cpumask(cpu));
- c->phys_proc_id = 0;
c->cpu_core_id = 0;
cpumask_clear_cpu(cpu, cpu_sibling_setup_mask);
recompute_smt_state();
This is a note to let you know that I've just added the patch titled
watchdog: indydog: Add dependency on SGI_HAS_INDYDOG
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
watchdog-indydog-add-dependency-on-sgi_has_indydog.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 24f8d233074badd4c18e4dafd2fb97d65838afed Mon Sep 17 00:00:00 2001
From: Matt Redfearn <matt.redfearn(a)mips.com>
Date: Tue, 14 Nov 2017 10:52:54 +0000
Subject: watchdog: indydog: Add dependency on SGI_HAS_INDYDOG
From: Matt Redfearn <matt.redfearn(a)mips.com>
commit 24f8d233074badd4c18e4dafd2fb97d65838afed upstream.
Commit da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
enabled building the Indy watchdog driver when COMPILE_TEST is enabled.
However, the driver makes reference to symbols that are only defined for
certain platforms are selected in the config. These platforms select
SGI_HAS_INDYDOG. Without this, link time errors result, for example
when building a MIPS allyesconfig.
drivers/watchdog/indydog.o: In function `indydog_write':
indydog.c:(.text+0x18): undefined reference to `sgimc'
indydog.c:(.text+0x1c): undefined reference to `sgimc'
drivers/watchdog/indydog.o: In function `indydog_start':
indydog.c:(.text+0x54): undefined reference to `sgimc'
indydog.c:(.text+0x58): undefined reference to `sgimc'
drivers/watchdog/indydog.o: In function `indydog_stop':
indydog.c:(.text+0xa4): undefined reference to `sgimc'
drivers/watchdog/indydog.o:indydog.c:(.text+0xa8): more undefined
references to `sgimc' follow
make: *** [Makefile:1005: vmlinux] Error 1
Fix this by ensuring that CONFIG_INDIDOG can only be selected when the
necessary dependent platform symbols are built in.
Fixes: da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
Signed-off-by: Matt Redfearn <matt.redfearn(a)mips.com>
Signed-off-by: Ralf Baechle <ralf(a)linux-mips.org>
Suggested-by: James Hogan <james.hogan(a)mips.com>
Reviewed-by: Guenter Roeck <linux(a)roeck-us.net>
Signed-off-by: Guenter Roeck <linux(a)roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim(a)iguana.be>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/watchdog/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -1451,7 +1451,7 @@ config RC32434_WDT
config INDYDOG
tristate "Indy/I2 Hardware Watchdog"
- depends on SGI_HAS_INDYDOG || (MIPS && COMPILE_TEST)
+ depends on SGI_HAS_INDYDOG
help
Hardware driver for the Indy's/I2's watchdog. This is a
watchdog timer that will reboot the machine after a 60 second
Patches currently in stable-queue which might be from matt.redfearn(a)mips.com are
queue-4.14/watchdog-indydog-add-dependency-on-sgi_has_indydog.patch
This is a regression fix.
Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
calling after disable_IO_APIC() in reboot and kexec/kdump code path.
This introdued a regression. The root cause is that disable_IO_APIC()
not only clears IO_APIC, also restore boot irq mode by setting
LAPIC/APIC/IMCR, calling lapic_shutdown() after disable_IO_APIC() will
disable LAPIC and ruin the possible virtual wire mode setting which
the code has been trying to do all along.
The consequence is, in KVM guest kernel always prints warning as below
during kexec/kdump kernel boots up. That happened in setup_local_APIC()
since 'do { xxx } while (queued && max_loops > 0)' loop does not function
well any more if pending irq exists in APIC IRR after LAPIC is disabled.
[ 0.001000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1467 setup_local_APIC+0x228/0x330
[ 0.001000] Modules linked in:
[ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #3
[ 0.001000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[ 0.001000] RIP: 0010:setup_local_APIC+0x228/0x330
[ 0.001000] RSP: 0000:ffffffffb6e03eb8 EFLAGS: 00010286
[ 0.001000] RAX: 0000009edb4c4d84 RBX: 0000000000000000 RCX: 00000000b099d800
[ 0.001000] RDX: 0000009e00000000 RSI: 0000000000000000 RDI: 0000000000000810
[ 0.001000] RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000001
[ 0.001000] R10: ffff98ce6a801c00 R11: 0761076d072f0776 R12: 0000000000000001
[ 0.001000] R13: 00000000000000f0 R14: 0000000000004000 R15: ffffffffffffc6ff
[ 0.001000] FS: 0000000000000000(0000) GS:ffff98ce6bc00000(0000) knlGS:0000000000000000
[ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.001000] CR2: 00000000ffffffff CR3: 0000000022209000 CR4: 00000000000406b0
[ 0.001000] Call Trace:
[ 0.001000] apic_bsp_setup+0x56/0x74
[ 0.001000] x86_late_time_init+0x11/0x16
[ 0.001000] start_kernel+0x3c9/0x486
[ 0.001000] secondary_startup_64+0xa5/0xb0
[ 0.001000] Code: 00 85 c9 74 2d 0f 31 c1 e1 0a 48 c1 e2 20 41 89 cf 4c 03 7c 24 08 48 09 d0 49 29 c7 4c 89 3c 24 48 83 3c 24 00 0f 8f 8f fe ff
ff <0f> ff e9 10 ff ff ff 48 83 2c 24 01 eb e7 48 83 c4 18 5b 5d 41
[ 0.001000] ---[ end trace b88e71b9a6ebebdd ]---
[ 0.001000] masked ExtINT on CPU#0
To fix this, just break down disable_IO_APIC(), then call
clear_IO_APIC() to stop IO_APIC where disable_IO_APIC() was called,
and call restore_boot_irq_mode() to restore boot irq mode before
reboot or kexec/kdump jump.
As for KEXEC_JUMP, it's a little different then reboot and kexec/kdump.
It doesn't call lapic_shutdown() before jump, so is not impacted by
commit 522e66464467. Here in order to keep it the same as the old code,
replace the old disable_IO_APIC() with clear_IO_APIC() and
restore_boot_irq_mode().
Signed-off-by: Baoquan He <bhe(a)redhat.com>
Fixes: commit 522e66464467 ("x86/apic: Disable I/O APIC before shutdown of the local APIC")
Cc: stable(a)vger.kernel.org
---
v4->v3:
Eric pointed out the change related to KEXEC_JUMP is not right.
Correct it.
Add Fixes tag and Cc to stable.
arch/x86/include/asm/io_apic.h | 1 +
arch/x86/kernel/apic/io_apic.c | 2 +-
arch/x86/kernel/crash.c | 3 ++-
arch/x86/kernel/machine_kexec_32.c | 3 ++-
arch/x86/kernel/machine_kexec_64.c | 3 ++-
arch/x86/kernel/reboot.c | 3 ++-
6 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
index 558d1a6a13ad..0fa95bfacb39 100644
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -193,6 +193,7 @@ static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
extern void setup_IO_APIC(void);
extern void enable_IO_APIC(void);
extern void disable_IO_APIC(void);
+extern void clear_IO_APIC(void);
extern void restore_boot_irq_mode(void);
extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
extern void print_IO_APICs(void);
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 7b73b6b9b4b6..2d7cd2db77f5 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -587,7 +587,7 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
mpc_ioapic_id(apic), pin);
}
-static void clear_IO_APIC (void)
+void clear_IO_APIC (void)
{
int apic, pin;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 10e74d4778a1..1f6680427ff0 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -199,9 +199,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
#ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
- disable_IO_APIC();
+ clear_IO_APIC();
#endif
lapic_shutdown();
+ restore_boot_irq_mode();
#ifdef CONFIG_HPET_TIMER
hpet_disable();
#endif
diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
index edfede768688..4cd79d88a4ac 100644
--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -199,7 +199,8 @@ void machine_kexec(struct kimage *image)
* one form or other. kexec jump path also need
* one.
*/
- disable_IO_APIC();
+ clear_IO_APIC();
+ restore_boot_irq_mode();
#endif
}
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 1f790cf9d38f..2ab14b9c1a89 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -297,7 +297,8 @@ void machine_kexec(struct kimage *image)
* one form or other. kexec jump path also need
* one.
*/
- disable_IO_APIC();
+ clear_IO_APIC();
+ restore_boot_irq_mode();
#endif
}
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 2126b9d27c34..725624b6c0c0 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -666,7 +666,7 @@ void native_machine_shutdown(void)
* Even without the erratum, it still makes sense to quiet IO APIC
* before disabling Local APIC.
*/
- disable_IO_APIC();
+ clear_IO_APIC();
#endif
#ifdef CONFIG_SMP
@@ -680,6 +680,7 @@ void native_machine_shutdown(void)
#endif
lapic_shutdown();
+ restore_boot_irq_mode();
#ifdef CONFIG_HPET_TIMER
hpet_disable();
--
2.13.6