On Mon, Oct 04, 2021 at 01:08:38AM -0500, Tyler Hicks wrote:
From: Rob Herring robh@kernel.org
commit 9885440b16b8fc1dd7275800fd28f56a92f60896 upstream.
The PCI code has several paths where the struct pci_host_bridge is freed directly. This is wrong because it contains a struct device which is refcounted and should be freed using put_device(). This can result in use-after-free errors. I think this problem has existed since 2012 with commit 7b5436635800 ("PCI: add generic device into pci_host_bridge struct"). It generally hasn't mattered as most host bridge drivers are still built-in and can't unbind.
The problem is a struct device should never be freed directly once device_initialize() is called and a ref is held, but that doesn't happen until pci_register_host_bridge(). There's then a window between allocating the host bridge and pci_register_host_bridge() where kfree should be used. This is fragile and requires callers to do the right thing. To fix this, we need to split device_register() into device_initialize() and device_add() calls, so that the host bridge struct is always freed by using a put_device().
devm_pci_alloc_host_bridge() is using devm_kzalloc() to allocate struct pci_host_bridge which will be freed directly. Instead, we can use a custom devres action to call put_device().
Link: https://lore.kernel.org/r/20200513223859.11295-2-robh@kernel.org Reported-by: Anders Roxell anders.roxell@linaro.org Tested-by: Anders Roxell anders.roxell@linaro.org Signed-off-by: Rob Herring robh@kernel.org Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Lorenzo Pieralisi lorenzo.pieralisi@arm.com Acked-by: Arnd Bergmann arnd@arndb.de [tyhicks: Minor contextual change in pci_init_host_bridge() due to the lack of a native_dpc member in the pci_host_bridge struct. It was added in v5.7 with commit ac1c8e35a326 ("PCI/DPC: Add Error Disconnect Recover (EDR) support")] Signed-off-by: Tyler Hicks tyhicks@linux.microsoft.com
This commit has been identified as a fix for random memory corruption that we're experiencing in production. The memory corruption is easily reproducible on 5.4.150 and we get a nice KASAN splat that led us to discovering the upstream fix that wasn't marked for stable inclusion. I don't see any obvious reasons why this wouldn't be a valid linux-5.4.y candidate and hope we can get it applied there.
I've verified that the KASAN splat goes away and I don't see any other evidence of the memory corruption issue once this commit is applied to 5.4.150.
Now queued up,t hanks.
greg k-h