On 4/24/25 5:55 PM, Jack Vogel wrote:
>
>
>> On Apr 24, 2025, at 16:15, Dave Jiang <dave.jiang(a)intel.com> wrote:
>>
>>
>>
>> On 4/24/25 3:59 PM, Jack Vogel wrote:
>>>
>>>
>>>> On Apr 24, 2025, at 15:40, Dave Jiang <dave.jiang(a)intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 4/24/25 3:34 PM, Jack Vogel wrote:
>>>>> I am having test issues with this patch, test system is running OL9, basically RHEL 9.5, the kernel boots ok, and the dmesg is clean… but the tests in accel-config dont pass. Specifically the crypto tests, this is due to vfio_pci_core not loading. Right now I’m not sure if any of this is my mistake, but at least it’s something I need to keep looking at.
>>>>>
>>>>> Also, since I saw that issue on the latest I did a backport to our UEK8 kernel which is 6.12.23, and on that kernel it still exhibited these messages on boot:
>>>>>
>>>>> *idxd*0000:6a:01.0: enabling device (0144 -> 0146)
>>>>>
>>>>> [ 21.112733] *idxd*0000:6a:01.0: failed to attach device pasid 1, domain type 4
>>>>>
>>>>> [ 21.120770] *idxd*0000:6a:01.0: No in-kernel DMA with PASID. -95
>>>>>
>>>>>
>>>>> Again, maybe an issue in my backporting… however I’d like to be sure.
>>>>
>>>> Can you verify against latest upstream kernel plus the patch and see if you still see the error?
>>>>
>>>> DJ
>>>
>>> Yes, the kernel was build from the tip this morning. Like I said, it got no messages booting up, all looked fine. But when running the actual test suite in the accel-config tarball specifically the iaa crypt tests, they failed and the dmesg was from vfio_pci_core failed to load with an unknown symbol.
>>
>> I'm not sure what the test consists of (haven't worked on this device for almost 2 years). But usually the device is either bound to the idxd driver or the vfio_pci driver. Not both. And if the idxd driver didn't emit any errors while loading, then the test failure may be something else...
>>
>> Another way to verify is to set CONFIG_IOMMU_DEFAULT_DMA_LAZY vs PASSTHROUGH. If the tests still fail then it's something else.
>>
>> DJ
>
> There isn’t a lot of ways to test this driver, yes DPDK will use it, but apart from that? So, the tests that are part of your (Intel) accel-config package are the only convenient way that I’ve found to do so. It is also convenient, there is a “make check” target in the top Makefile that will invoke both set of DMA tests, and some crypto (IAA) tests. I have been planning to give this to our QA group as a verification suite. Do you have an alternative to this?
This should be the right test package. Let me talk to our QA people and see if there are any issues. We can resolve this off list. If there's any issues that end up pointing to the original bug, we can raise that then.
DJ
>
> Jack
>
>>
>>>
>>> This sounds like the module was wrong, but i would think it was installed with the v6.15 kernel…..
>>>
>>> Jack
>>>
>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Jack
>>>>>
>>>>>
>>>>>> On Apr 23, 2025, at 20:41, Lu Baolu <baolu.lu(a)linux.intel.com> wrote:
>>>>>>
>>>>>> The idxd driver attaches the default domain to a PASID of the device to
>>>>>> perform kernel DMA using that PASID. The domain is attached to the
>>>>>> device's PASID through iommu_attach_device_pasid(), which checks if the
>>>>>> domain->owner matches the iommu_ops retrieved from the device. If they
>>>>>> do not match, it returns a failure.
>>>>>>
>>>>>> if (ops != domain->owner || pasid == IOMMU_NO_PASID)
>>>>>> return -EINVAL;
>>>>>>
>>>>>> The static identity domain implemented by the intel iommu driver doesn't
>>>>>> specify the domain owner. Therefore, kernel DMA with PASID doesn't work
>>>>>> for the idxd driver if the device translation mode is set to passthrough.
>>>>>>
>>>>>> Generally the owner field of static domains are not set because they are
>>>>>> already part of iommu ops. Add a helper domain_iommu_ops_compatible()
>>>>>> that checks if a domain is compatible with the device's iommu ops. This
>>>>>> helper explicitly allows the static blocked and identity domains associated
>>>>>> with the device's iommu_ops to be considered compatible.
>>>>>>
>>>>>> Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain")
>>>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220031
>>>>>> Cc: stable(a)vger.kernel.org
>>>>>> Suggested-by: Jason Gunthorpe <jgg(a)nvidia.com>
>>>>>> Link: https://lore.kernel.org/linux-iommu/20250422191554.GC1213339@ziepe.ca/
>>>>>> Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com>
>>>>>> Reviewed-by: Dave Jiang <dave.jiang(a)intel.com>
>>>>>> Reviewed-by: Robin Murphy <robin.murphy(a)arm.com>
>>>>>> ---
>>>>>> drivers/iommu/iommu.c | 21 ++++++++++++++++++---
>>>>>> 1 file changed, 18 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> Change log:
>>>>>> v3:
>>>>>> - Convert all places checking domain->owner to the new helper.
>>>>>> v2: https://lore.kernel.org/linux-iommu/20250423021839.2189204-1-baolu.lu@linux…
>>>>>> - Make the solution generic for all static domains as suggested by
>>>>>> Jason.
>>>>>> v1: https://lore.kernel.org/linux-iommu/20250422075422.2084548-1-baolu.lu@linux…
>>>>>>
>>>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>>>> index 4f91a740c15f..b26fc3ed9f01 100644
>>>>>> --- a/drivers/iommu/iommu.c
>>>>>> +++ b/drivers/iommu/iommu.c
>>>>>> @@ -2204,6 +2204,19 @@ static void *iommu_make_pasid_array_entry(struct iommu_domain *domain,
>>>>>> return xa_tag_pointer(domain, IOMMU_PASID_ARRAY_DOMAIN);
>>>>>> }
>>>>>>
>>>>>> +static bool domain_iommu_ops_compatible(const struct iommu_ops *ops,
>>>>>> +struct iommu_domain *domain)
>>>>>> +{
>>>>>> +if (domain->owner == ops)
>>>>>> +return true;
>>>>>> +
>>>>>> +/* For static domains, owner isn't set. */
>>>>>> +if (domain == ops->blocked_domain || domain == ops->identity_domain)
>>>>>> +return true;
>>>>>> +
>>>>>> +return false;
>>>>>> +}
>>>>>> +
>>>>>> static int __iommu_attach_group(struct iommu_domain *domain,
>>>>>> struct iommu_group *group)
>>>>>> {
>>>>>> @@ -2214,7 +2227,8 @@ static int __iommu_attach_group(struct iommu_domain *domain,
>>>>>> return -EBUSY;
>>>>>>
>>>>>> dev = iommu_group_first_dev(group);
>>>>>> -if (!dev_has_iommu(dev) || dev_iommu_ops(dev) != domain->owner)
>>>>>> +if (!dev_has_iommu(dev) ||
>>>>>> + !domain_iommu_ops_compatible(dev_iommu_ops(dev), domain))
>>>>>> return -EINVAL;
>>>>>>
>>>>>> return __iommu_group_set_domain(group, domain);
>>>>>> @@ -3435,7 +3449,8 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
>>>>>> !ops->blocked_domain->ops->set_dev_pasid)
>>>>>> return -EOPNOTSUPP;
>>>>>>
>>>>>> -if (ops != domain->owner || pasid == IOMMU_NO_PASID)
>>>>>> +if (!domain_iommu_ops_compatible(ops, domain) ||
>>>>>> + pasid == IOMMU_NO_PASID)
>>>>>> return -EINVAL;
>>>>>>
>>>>>> mutex_lock(&group->mutex);
>>>>>> @@ -3511,7 +3526,7 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
>>>>>> if (!domain->ops->set_dev_pasid)
>>>>>> return -EOPNOTSUPP;
>>>>>>
>>>>>> -if (dev_iommu_ops(dev) != domain->owner ||
>>>>>> +if (!domain_iommu_ops_compatible(dev_iommu_ops(dev), domain) ||
>>>>>> pasid == IOMMU_NO_PASID || !handle)
>>>>>> return -EINVAL;
>>>>>>
>>>>>> --
>>>>>> 2.43.0
>
The patch titled
Subject: mm/userfaultfd: fix uninitialized output field for -EAGAIN race
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-userfaultfd-fix-uninitialized-output-field-for-eagain-race.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Peter Xu <peterx(a)redhat.com>
Subject: mm/userfaultfd: fix uninitialized output field for -EAGAIN race
Date: Thu, 24 Apr 2025 17:57:28 -0400
While discussing some userfaultfd relevant issues recently, Andrea noticed
a potential ABI breakage with -EAGAIN on almost all userfaultfd ioctl()s.
Quote from Andrea, explaining how -EAGAIN was processed, and how this
should fix it (taking example of UFFDIO_COPY ioctl):
The "mmap_changing" and "stale pmd" conditions are already reported as
-EAGAIN written in the copy field, this does not change it. This change
removes the subnormal case that left copy.copy uninitialized and required
apps to explicitly set the copy field to get deterministic
behavior (which is a requirement contrary to the documentation in both
the manpage and source code). In turn there's no alteration to backwards
compatibility as result of this change because userland will find the
copy field consistently set to -EAGAIN, and not anymore sometime -EAGAIN
and sometime uninitialized.
Even then the change only can make a difference to non cooperative users
of userfaultfd, so when UFFD_FEATURE_EVENT_* is enabled, which is not
true for the vast majority of apps using userfaultfd or this unintended
uninitialized field may have been noticed sooner.
Meanwhile, since this bug existed for years, it also almost affects all
ioctl()s that was introduced later. Besides UFFDIO_ZEROPAGE, these also
get affected in the same way:
- UFFDIO_CONTINUE
- UFFDIO_POISON
- UFFDIO_MOVE
This patch should have fixed all of them.
Link: https://lkml.kernel.org/r/20250424215729.194656-2-peterx@redhat.com
Fixes: df2cc96e7701 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
Fixes: fc71884a5f59 ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Reported-by: Andrea Arcangeli <aarcange(a)redhat.com>
Suggested-by: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Mike Rapoport <rppt(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/userfaultfd.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
--- a/fs/userfaultfd.c~mm-userfaultfd-fix-uninitialized-output-field-for-eagain-race
+++ a/fs/userfaultfd.c
@@ -1585,8 +1585,11 @@ static int userfaultfd_copy(struct userf
user_uffdio_copy = (struct uffdio_copy __user *) arg;
ret = -EAGAIN;
- if (atomic_read(&ctx->mmap_changing))
+ if (unlikely(atomic_read(&ctx->mmap_changing))) {
+ if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
+ return -EFAULT;
goto out;
+ }
ret = -EFAULT;
if (copy_from_user(&uffdio_copy, user_uffdio_copy,
@@ -1641,8 +1644,11 @@ static int userfaultfd_zeropage(struct u
user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
ret = -EAGAIN;
- if (atomic_read(&ctx->mmap_changing))
+ if (unlikely(atomic_read(&ctx->mmap_changing))) {
+ if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage)))
+ return -EFAULT;
goto out;
+ }
ret = -EFAULT;
if (copy_from_user(&uffdio_zeropage, user_uffdio_zeropage,
@@ -1744,8 +1750,11 @@ static int userfaultfd_continue(struct u
user_uffdio_continue = (struct uffdio_continue __user *)arg;
ret = -EAGAIN;
- if (atomic_read(&ctx->mmap_changing))
+ if (unlikely(atomic_read(&ctx->mmap_changing))) {
+ if (unlikely(put_user(ret, &user_uffdio_continue->mapped)))
+ return -EFAULT;
goto out;
+ }
ret = -EFAULT;
if (copy_from_user(&uffdio_continue, user_uffdio_continue,
@@ -1801,8 +1810,11 @@ static inline int userfaultfd_poison(str
user_uffdio_poison = (struct uffdio_poison __user *)arg;
ret = -EAGAIN;
- if (atomic_read(&ctx->mmap_changing))
+ if (unlikely(atomic_read(&ctx->mmap_changing))) {
+ if (unlikely(put_user(ret, &user_uffdio_poison->updated)))
+ return -EFAULT;
goto out;
+ }
ret = -EFAULT;
if (copy_from_user(&uffdio_poison, user_uffdio_poison,
@@ -1870,8 +1882,12 @@ static int userfaultfd_move(struct userf
user_uffdio_move = (struct uffdio_move __user *) arg;
- if (atomic_read(&ctx->mmap_changing))
- return -EAGAIN;
+ ret = -EAGAIN;
+ if (unlikely(atomic_read(&ctx->mmap_changing))) {
+ if (unlikely(put_user(ret, &user_uffdio_move->move)))
+ return -EFAULT;
+ goto out;
+ }
if (copy_from_user(&uffdio_move, user_uffdio_move,
/* don't copy "move" last field */
_
Patches currently in -mm which might be from peterx(a)redhat.com are
mm-userfaultfd-fix-uninitialized-output-field-for-eagain-race.patch
mm-selftests-add-a-test-to-verify-mmap_changing-race-with-eagain.patch
Ricardo reported a KASAN discovered use after free in v6.6-stable.
The syzbot starts a BPF program via xdp_test_run_batch() which assigns
ri->tgt_value via dev_hash_map_redirect() and the return code isn't
XDP_REDIRECT it looks like nonsense. So the output in
bpf_warn_invalid_xdp_action() appears once.
Then the TUN driver runs another BPF program (on the same CPU) which
returns XDP_REDIRECT without setting ri->tgt_value first. It invokes
bpf_trace_printk() to print four characters and obtain the required
return value. This is enough to get xdp_do_redirect() invoked which
then accesses the pointer in tgt_value which might have been already
deallocated.
This problem does not affect upstream because since commit
401cb7dae8130 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.")
the per-CPU variable is referenced via task's task_struct and exists on
the stack during NAPI callback. Therefore it is cleared once before the
first invocation and remains valid within the RCU section of the NAPI
callback.
Instead of performing the huge backport of the commit (plus its fix ups)
here is an alternative version which only resets the variable in
question prior invoking the BPF program.
Acked-by: Toke Høiland-Jørgensen <toke(a)kernel.org>
Reported-by: Ricardo Cañuelo Navarro <rcn(a)igalia.com>
Closes: https://lore.kernel.org/all/20250226-20250204-kasan-slab-use-after-free-rea…
Fixes: 97f91a7cf04ff ("bpf: add bpf_redirect_map helper routine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
---
I discussed this with Toke, thread starts at
https://lore.kernel.org/all/20250313183911.SPAmGLyw@linutronix.de/
The commit, which this by accident, is part of v6.11-rc1.
I added the commit introducing map redirects as the origin of the
problem which is v4.14-rc1. The code is a bit different there but it
seems to work similar.
Affected kernels would be from v4.14 to v6.10.
include/net/xdp.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index de08c8e0d1348..b39ac83618a55 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -486,7 +486,14 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
* under local_bh_disable(), which provides the needed RCU protection
* for accessing map entries.
*/
- u32 act = __bpf_prog_run(prog, xdp, BPF_DISPATCHER_FUNC(xdp));
+ struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ u32 act;
+
+ if (ri->map_id || ri->map_type) {
+ ri->map_id = 0;
+ ri->map_type = BPF_MAP_TYPE_UNSPEC;
+ }
+ act = __bpf_prog_run(prog, xdp, BPF_DISPATCHER_FUNC(xdp));
if (static_branch_unlikely(&bpf_master_redirect_enabled_key)) {
if (act == XDP_TX && netif_is_bond_slave(xdp->rxq->dev))
--
2.49.0
On 4/24/25 3:59 PM, Jack Vogel wrote:
>
>
>> On Apr 24, 2025, at 15:40, Dave Jiang <dave.jiang(a)intel.com> wrote:
>>
>>
>>
>> On 4/24/25 3:34 PM, Jack Vogel wrote:
>>> I am having test issues with this patch, test system is running OL9, basically RHEL 9.5, the kernel boots ok, and the dmesg is clean… but the tests in accel-config dont pass. Specifically the crypto tests, this is due to vfio_pci_core not loading. Right now I’m not sure if any of this is my mistake, but at least it’s something I need to keep looking at.
>>>
>>> Also, since I saw that issue on the latest I did a backport to our UEK8 kernel which is 6.12.23, and on that kernel it still exhibited these messages on boot:
>>>
>>> *idxd*0000:6a:01.0: enabling device (0144 -> 0146)
>>>
>>> [ 21.112733] *idxd*0000:6a:01.0: failed to attach device pasid 1, domain type 4
>>>
>>> [ 21.120770] *idxd*0000:6a:01.0: No in-kernel DMA with PASID. -95
>>>
>>>
>>> Again, maybe an issue in my backporting… however I’d like to be sure.
>>
>> Can you verify against latest upstream kernel plus the patch and see if you still see the error?
>>
>> DJ
>
> Yes, the kernel was build from the tip this morning. Like I said, it got no messages booting up, all looked fine. But when running the actual test suite in the accel-config tarball specifically the iaa crypt tests, they failed and the dmesg was from vfio_pci_core failed to load with an unknown symbol.
I'm not sure what the test consists of (haven't worked on this device for almost 2 years). But usually the device is either bound to the idxd driver or the vfio_pci driver. Not both. And if the idxd driver didn't emit any errors while loading, then the test failure may be something else...
Another way to verify is to set CONFIG_IOMMU_DEFAULT_DMA_LAZY vs PASSTHROUGH. If the tests still fail then it's something else.
DJ
>
> This sounds like the module was wrong, but i would think it was installed with the v6.15 kernel…..
>
> Jack
>
>>
>>>
>>> Cheers,
>>>
>>> Jack
>>>
>>>
>>>> On Apr 23, 2025, at 20:41, Lu Baolu <baolu.lu(a)linux.intel.com> wrote:
>>>>
>>>> The idxd driver attaches the default domain to a PASID of the device to
>>>> perform kernel DMA using that PASID. The domain is attached to the
>>>> device's PASID through iommu_attach_device_pasid(), which checks if the
>>>> domain->owner matches the iommu_ops retrieved from the device. If they
>>>> do not match, it returns a failure.
>>>>
>>>> if (ops != domain->owner || pasid == IOMMU_NO_PASID)
>>>> return -EINVAL;
>>>>
>>>> The static identity domain implemented by the intel iommu driver doesn't
>>>> specify the domain owner. Therefore, kernel DMA with PASID doesn't work
>>>> for the idxd driver if the device translation mode is set to passthrough.
>>>>
>>>> Generally the owner field of static domains are not set because they are
>>>> already part of iommu ops. Add a helper domain_iommu_ops_compatible()
>>>> that checks if a domain is compatible with the device's iommu ops. This
>>>> helper explicitly allows the static blocked and identity domains associated
>>>> with the device's iommu_ops to be considered compatible.
>>>>
>>>> Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain")
>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220031
>>>> Cc: stable(a)vger.kernel.org
>>>> Suggested-by: Jason Gunthorpe <jgg(a)nvidia.com>
>>>> Link: https://lore.kernel.org/linux-iommu/20250422191554.GC1213339@ziepe.ca/
>>>> Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com>
>>>> Reviewed-by: Dave Jiang <dave.jiang(a)intel.com>
>>>> Reviewed-by: Robin Murphy <robin.murphy(a)arm.com>
>>>> ---
>>>> drivers/iommu/iommu.c | 21 ++++++++++++++++++---
>>>> 1 file changed, 18 insertions(+), 3 deletions(-)
>>>>
>>>> Change log:
>>>> v3:
>>>> - Convert all places checking domain->owner to the new helper.
>>>> v2: https://lore.kernel.org/linux-iommu/20250423021839.2189204-1-baolu.lu@linux…
>>>> - Make the solution generic for all static domains as suggested by
>>>> Jason.
>>>> v1: https://lore.kernel.org/linux-iommu/20250422075422.2084548-1-baolu.lu@linux…
>>>>
>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>> index 4f91a740c15f..b26fc3ed9f01 100644
>>>> --- a/drivers/iommu/iommu.c
>>>> +++ b/drivers/iommu/iommu.c
>>>> @@ -2204,6 +2204,19 @@ static void *iommu_make_pasid_array_entry(struct iommu_domain *domain,
>>>> return xa_tag_pointer(domain, IOMMU_PASID_ARRAY_DOMAIN);
>>>> }
>>>>
>>>> +static bool domain_iommu_ops_compatible(const struct iommu_ops *ops,
>>>> +struct iommu_domain *domain)
>>>> +{
>>>> +if (domain->owner == ops)
>>>> +return true;
>>>> +
>>>> +/* For static domains, owner isn't set. */
>>>> +if (domain == ops->blocked_domain || domain == ops->identity_domain)
>>>> +return true;
>>>> +
>>>> +return false;
>>>> +}
>>>> +
>>>> static int __iommu_attach_group(struct iommu_domain *domain,
>>>> struct iommu_group *group)
>>>> {
>>>> @@ -2214,7 +2227,8 @@ static int __iommu_attach_group(struct iommu_domain *domain,
>>>> return -EBUSY;
>>>>
>>>> dev = iommu_group_first_dev(group);
>>>> -if (!dev_has_iommu(dev) || dev_iommu_ops(dev) != domain->owner)
>>>> +if (!dev_has_iommu(dev) ||
>>>> + !domain_iommu_ops_compatible(dev_iommu_ops(dev), domain))
>>>> return -EINVAL;
>>>>
>>>> return __iommu_group_set_domain(group, domain);
>>>> @@ -3435,7 +3449,8 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
>>>> !ops->blocked_domain->ops->set_dev_pasid)
>>>> return -EOPNOTSUPP;
>>>>
>>>> -if (ops != domain->owner || pasid == IOMMU_NO_PASID)
>>>> +if (!domain_iommu_ops_compatible(ops, domain) ||
>>>> + pasid == IOMMU_NO_PASID)
>>>> return -EINVAL;
>>>>
>>>> mutex_lock(&group->mutex);
>>>> @@ -3511,7 +3526,7 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
>>>> if (!domain->ops->set_dev_pasid)
>>>> return -EOPNOTSUPP;
>>>>
>>>> -if (dev_iommu_ops(dev) != domain->owner ||
>>>> +if (!domain_iommu_ops_compatible(dev_iommu_ops(dev), domain) ||
>>>> pasid == IOMMU_NO_PASID || !handle)
>>>> return -EINVAL;
>>>>
>>>> --
>>>> 2.43.0
>
On 4/24/25 3:34 PM, Jack Vogel wrote:
> I am having test issues with this patch, test system is running OL9, basically RHEL 9.5, the kernel boots ok, and the dmesg is clean… but the tests in accel-config dont pass. Specifically the crypto tests, this is due to vfio_pci_core not loading. Right now I’m not sure if any of this is my mistake, but at least it’s something I need to keep looking at.
>
> Also, since I saw that issue on the latest I did a backport to our UEK8 kernel which is 6.12.23, and on that kernel it still exhibited these messages on boot:
>
> *idxd*0000:6a:01.0: enabling device (0144 -> 0146)
>
> [ 21.112733] *idxd*0000:6a:01.0: failed to attach device pasid 1, domain type 4
>
> [ 21.120770] *idxd*0000:6a:01.0: No in-kernel DMA with PASID. -95
>
>
> Again, maybe an issue in my backporting… however I’d like to be sure.
Can you verify against latest upstream kernel plus the patch and see if you still see the error?
DJ
>
> Cheers,
>
> Jack
>
>
>> On Apr 23, 2025, at 20:41, Lu Baolu <baolu.lu(a)linux.intel.com> wrote:
>>
>> The idxd driver attaches the default domain to a PASID of the device to
>> perform kernel DMA using that PASID. The domain is attached to the
>> device's PASID through iommu_attach_device_pasid(), which checks if the
>> domain->owner matches the iommu_ops retrieved from the device. If they
>> do not match, it returns a failure.
>>
>> if (ops != domain->owner || pasid == IOMMU_NO_PASID)
>> return -EINVAL;
>>
>> The static identity domain implemented by the intel iommu driver doesn't
>> specify the domain owner. Therefore, kernel DMA with PASID doesn't work
>> for the idxd driver if the device translation mode is set to passthrough.
>>
>> Generally the owner field of static domains are not set because they are
>> already part of iommu ops. Add a helper domain_iommu_ops_compatible()
>> that checks if a domain is compatible with the device's iommu ops. This
>> helper explicitly allows the static blocked and identity domains associated
>> with the device's iommu_ops to be considered compatible.
>>
>> Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain")
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220031
>> Cc: stable(a)vger.kernel.org
>> Suggested-by: Jason Gunthorpe <jgg(a)nvidia.com>
>> Link: https://lore.kernel.org/linux-iommu/20250422191554.GC1213339@ziepe.ca/
>> Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com>
>> Reviewed-by: Dave Jiang <dave.jiang(a)intel.com>
>> Reviewed-by: Robin Murphy <robin.murphy(a)arm.com>
>> ---
>> drivers/iommu/iommu.c | 21 ++++++++++++++++++---
>> 1 file changed, 18 insertions(+), 3 deletions(-)
>>
>> Change log:
>> v3:
>> - Convert all places checking domain->owner to the new helper.
>> v2: https://lore.kernel.org/linux-iommu/20250423021839.2189204-1-baolu.lu@linux…
>> - Make the solution generic for all static domains as suggested by
>> Jason.
>> v1: https://lore.kernel.org/linux-iommu/20250422075422.2084548-1-baolu.lu@linux…
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 4f91a740c15f..b26fc3ed9f01 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2204,6 +2204,19 @@ static void *iommu_make_pasid_array_entry(struct iommu_domain *domain,
>> return xa_tag_pointer(domain, IOMMU_PASID_ARRAY_DOMAIN);
>> }
>>
>> +static bool domain_iommu_ops_compatible(const struct iommu_ops *ops,
>> +struct iommu_domain *domain)
>> +{
>> +if (domain->owner == ops)
>> +return true;
>> +
>> +/* For static domains, owner isn't set. */
>> +if (domain == ops->blocked_domain || domain == ops->identity_domain)
>> +return true;
>> +
>> +return false;
>> +}
>> +
>> static int __iommu_attach_group(struct iommu_domain *domain,
>> struct iommu_group *group)
>> {
>> @@ -2214,7 +2227,8 @@ static int __iommu_attach_group(struct iommu_domain *domain,
>> return -EBUSY;
>>
>> dev = iommu_group_first_dev(group);
>> -if (!dev_has_iommu(dev) || dev_iommu_ops(dev) != domain->owner)
>> +if (!dev_has_iommu(dev) ||
>> + !domain_iommu_ops_compatible(dev_iommu_ops(dev), domain))
>> return -EINVAL;
>>
>> return __iommu_group_set_domain(group, domain);
>> @@ -3435,7 +3449,8 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
>> !ops->blocked_domain->ops->set_dev_pasid)
>> return -EOPNOTSUPP;
>>
>> -if (ops != domain->owner || pasid == IOMMU_NO_PASID)
>> +if (!domain_iommu_ops_compatible(ops, domain) ||
>> + pasid == IOMMU_NO_PASID)
>> return -EINVAL;
>>
>> mutex_lock(&group->mutex);
>> @@ -3511,7 +3526,7 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
>> if (!domain->ops->set_dev_pasid)
>> return -EOPNOTSUPP;
>>
>> -if (dev_iommu_ops(dev) != domain->owner ||
>> +if (!domain_iommu_ops_compatible(dev_iommu_ops(dev), domain) ||
>> pasid == IOMMU_NO_PASID || !handle)
>> return -EINVAL;
>>
>> --
>> 2.43.0
>>
>
The quilt patch titled
Subject: smaps: fix crash in smaps_hugetlb_range for non-present hugetlb entries
has been removed from the -mm tree. Its filename was
smaps-fix-crash-in-smaps_hugetlb_range-for-non-present-hugetlb-entries.patch
This patch was dropped because an alternative patch was or shall be merged
------------------------------------------------------
From: Ming Wang <wangming01(a)loongson.cn>
Subject: smaps: fix crash in smaps_hugetlb_range for non-present hugetlb entries
Date: Wed, 23 Apr 2025 09:03:59 +0800
When reading /proc/pid/smaps for a process that has mapped a hugetlbfs
file with MAP_PRIVATE, the kernel might crash inside
pfn_swap_entry_to_page. This occurs on LoongArch under specific
conditions.
The root cause involves several steps:
1. When the hugetlbfs file is mapped (MAP_PRIVATE), the initial PMD
(or relevant level) entry is often populated by the kernel during
mmap() with a non-present entry pointing to the architecture's
invalid_pte_table On the affected LoongArch system, this address was
observed to be 0x90000000031e4000.
2. The smaps walker (walk_hugetlb_range -> smaps_hugetlb_range) reads
this entry.
3. The generic is_swap_pte() macro checks `!pte_present() &&
!pte_none()`. The entry (invalid_pte_table address) is not present.
Crucially, the generic pte_none() check (`!(pte_val(pte) &
~_PAGE_GLOBAL)`) returns false because the invalid_pte_table address is
non-zero. Therefore, is_swap_pte() incorrectly returns true.
4. The code enters the `else if (is_swap_pte(...))` block.
5. Inside this block, it checks `is_pfn_swap_entry()`. Due to a bit
pattern coincidence in the invalid_pte_table address on LoongArch, the
embedded generic `is_migration_entry()` check happens to return true
(misinterpreting parts of the address as a migration type).
6. This leads to a call to pfn_swap_entry_to_page() with the bogus
swap entry derived from the invalid table address.
7. pfn_swap_entry_to_page() extracts a meaningless PFN, finds an
unrelated struct page, checks its lock status (unlocked), and hits the
`BUG_ON(is_migration_entry(entry) && !PageLocked(p))` assertion.
The original code's intent in the `else if` block seems aimed at handling
potential migration entries, as indicated by the inner
`is_pfn_swap_entry()` check. The issue arises because the outer
`is_swap_pte()` check incorrectly includes the invalid table pointer case
on LoongArch.
This patch fixes the issue by changing the condition in
smaps_hugetlb_range() from the broad `is_swap_pte()` to the specific
`is_hugetlb_entry_migration()`.
The `is_hugetlb_entry_migration()` helper function correctly handles this
by first checking `huge_pte_none()`. Architectures like LoongArch can
provide an override for `huge_pte_none()` that specifically recognizes the
`invalid_pte_table` address as a "none" state for HugeTLB entries. This
ensures `is_hugetlb_entry_migration()` returns false for the invalid
entry, preventing the code from entering the faulty block.
This change makes the code reflect the likely original intent (handling
migration) more accurately and leverages architecture-specific helpers
(`huge_pte_none`) to correctly interpret special PTE/PMD values in the
HugeTLB context, fixing the crash on LoongArch without altering the
generic is_swap_pte() behavior.
Link: https://lkml.kernel.org/r/20250423010359.2030576-1-wangming01@loongson.cn
Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
Co-developed-by: Hongchen Zhang <zhanghongchen(a)loongson.cn>
Signed-off-by: Hongchen Zhang <zhanghongchen(a)loongson.cn>
Signed-off-by: Ming Wang <wangming01(a)loongson.cn>
Cc: Andrii Nakryiko <andrii(a)kernel.org>
Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Huacai Chen <chenhuacai(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Joern Engel <joern(a)logfs.org>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.cz>
Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/task_mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/proc/task_mmu.c~smaps-fix-crash-in-smaps_hugetlb_range-for-non-present-hugetlb-entries
+++ a/fs/proc/task_mmu.c
@@ -1027,7 +1027,7 @@ static int smaps_hugetlb_range(pte_t *pt
if (pte_present(ptent)) {
folio = page_folio(pte_page(ptent));
present = true;
- } else if (is_swap_pte(ptent)) {
+ } else if (is_hugetlb_entry_migration(ptent)) {
swp_entry_t swpent = pte_to_swp_entry(ptent);
if (is_pfn_swap_entry(swpent))
_
Patches currently in -mm which might be from wangming01(a)loongson.cn are
When running machines with 64k page size and a 16k nodesize we started
seeing tree log corruption in production. This turned out to be because
we were not writing out dirty blocks sometimes, so this in fact affects
all metadata writes.
When writing out a subpage EB we scan the subpage bitmap for a dirty
range. If the range isn't dirty we do
bit_start++;
to move onto the next bit. The problem is the bitmap is based on the
number of sectors that an EB has. So in this case, we have a 64k
pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4
bits for every node. With a 64k page size we end up with 4 nodes per
page.
To make this easier this is how everything looks
[0 16k 32k 48k ] logical address
[0 4 8 12 ] radix tree offset
[ 64k page ] folio
[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
[ | | | | | | | | | | | | | | | | ] bitmap
Now we use all of our addressing based on fs_info->sectorsize_bits, so
as you can see the above our 16k eb->start turns into radix entry 4.
When we find a dirty range for our eb, we correctly do bit_start +=
sectors_per_node, because if we start at bit 0, the next bit for the
next eb is 4, to correspond to eb->start 16k.
However if our range is clean, we will do bit_start++, which will now
put us offset from our radix tree entries.
In our case, assume that the first time we check the bitmap the block is
not dirty, we increment bit_start so now it == 1, and then we loop
around and check again. This time it is dirty, and we go to find that
start using the following equation
start = folio_start + bit_start * fs_info->sectorsize;
so in the case above, eb->start 0 is now dirty, and we calculate start
as
0 + 1 * fs_info->sectorsize = 4096
4096 >> 12 = 1
Now we're looking up the radix tree for 1, and we won't find an eb.
What's worse is now we're using bit_start == 1, so we do bit_start +=
sectors_per_node, which is now 5. If that eb is dirty we will run into
the same thing, we will look at an offset that is not populated in the
radix tree, and now we're skipping the writeout of dirty extent buffers.
The best fix for this is to not use sectorsize_bits to address nodes,
but that's a larger change. Since this is a fs corruption problem fix
it simply by always using sectors_per_node to increment the start bit.
cc: stable(a)vger.kernel.org
Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
Reviewed-by: Boris Burkov <boris(a)bur.io>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
---
- Further testing indicated that the page tagging theoretical race isn't getting
hit in practice, so we're going to limit the "hotfix" to this specific patch,
and then send subsequent patches to address the other issues we're hitting. My
simplify metadata writebback patches are the more wholistic fix.
fs/btrfs/extent_io.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5f08615b334f..6cfd286b8bbc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
subpage->bitmaps)) {
spin_unlock_irqrestore(&subpage->lock, flags);
spin_unlock(&folio->mapping->i_private_lock);
- bit_start++;
+ bit_start += sectors_per_node;
continue;
}
--
2.48.1