Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know that FOLL_FORCE can be possibly dangerous, especially if there are races that can be exploited by user space.
Right now, it would be sufficient to have some code that sets a PTE of a R/O-mapped shared page dirty, in order for it to erroneously become writable by FOLL_FORCE. The implications of setting a write-protected PTE dirty might not be immediately obvious to everyone.
And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map a shmem page R/O while marking the pte dirty. This can be used by unprivileged user space to modify tmpfs/shmem file content even if the user does not have write permissions to the file -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
To fix such security issues for good, the insight is that we really only need that fancy retry logic (FOLL_COW) for COW mappings that are not writable (!VM_WRITE). And in a COW mapping, we really only broke COW if we have an exclusive anonymous page mapped. If we have something else mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), we have to trigger a write fault to break COW. If we don't find an exclusive anonymous page when we retry, we have to trigger COW breaking once again because something intervened.
Let's move away from this mandatory-retry + dirty handling and rely on our PageAnonExclusive() flag for making a similar decision, to use the same COW logic as in other kernel parts here as well. In case we stumble over a PTE in a COW mapping that does not map an exclusive anonymous page, COW was not properly broken and we have to trigger a fake write-fault to break COW.
Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection") and commit 76aefad628aa ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take care of softdirty and uffd-wp manually.
For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA.
This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So let's just get rid of it.
Note 1: We don't check for the PTE being dirty because it doesn't matter for making a "was COWed" decision anymore, and whoever modifies the page has to set the page dirty either way.
Note 2: Kernels before extended uffd-wp support and before PageAnonExclusive (< 5.19) can simply revert the problematic commit instead and be safe regarding UFFDIO_CONTINUE. A backport to v5.19 requires minor adjustments due to lack of vma_soft_dirty_enabled().
Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte") Cc: stable@vger.kernel.org # 5.16+ Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Axel Rasmussen axelrasmussen@google.com Cc: Peter Xu peterx@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: Vlastimil Babka vbabka@suse.cz Cc: John Hubbard jhubbard@nvidia.com Cc: Jason Gunthorpe jgg@nvidia.com Signed-off-by: David Hildenbrand david@redhat.com ---
Against upstream from yesterday instead of v5.19 because I wanted to reference the mprotect commit IDs and can_change_pte_writable(), and I wanted to directly use vma_soft_dirty_enabled().
I have a working reproducer that I'll post to oss-security in one week. Of course, that reproducer no longer triggers with that commit and my ptrace testing indicated that FOLL_FORCE seems to continue working as expected.
--- include/linux/mm.h | 1 - mm/gup.c | 62 +++++++++++++++++++++++++++++----------------- mm/huge_memory.c | 45 +++++++++++++++++---------------- 3 files changed, 63 insertions(+), 45 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 18e01474cf6b..2222ed598112 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2885,7 +2885,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ #define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_COW 0x4000 /* internal GUP flag */ #define FOLL_ANON 0x8000 /* don't do file mappings */ #define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ diff --git a/mm/gup.c b/mm/gup.c index 732825157430..7a0b207f566f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -478,14 +478,34 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, return -EEXIST; }
-/* - * FOLL_FORCE can write to even unwritable pte's, but only - * after we've gone through a COW cycle and they are dirty. - */ -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) -{ - return pte_write(pte) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); +/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */ +static inline bool can_follow_write_pte(pte_t pte, struct page *page, + struct vm_area_struct *vma, + unsigned int flags) +{ + if (pte_write(pte)) + return true; + if (!(flags & FOLL_FORCE)) + return false; + + /* + * See check_vma_flags(): only COW mappings need that special + * "force" handling when they lack VM_WRITE. + */ + if (vma->vm_flags & VM_WRITE) + return false; + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); + + /* + * See can_change_pte_writable(): we broke COW and could map the page + * writable if we have an exclusive anonymous page and a write-fault + * isn't require for other reasons. + */ + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) + return false; + if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) + return false; + return !userfaultfd_pte_wp(vma, pte); }
static struct page *follow_page_pte(struct vm_area_struct *vma, @@ -528,12 +548,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } if ((flags & FOLL_NUMA) && pte_protnone(pte)) goto no_page; - if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) { - pte_unmap_unlock(ptep, ptl); - return NULL; - }
page = vm_normal_page(vma, address, pte); + + /* + * We only care about anon pages in can_follow_write_pte() and don't + * have to worry about pte_devmap() because they are never anon. + */ + if ((flags & FOLL_WRITE) && + !can_follow_write_pte(pte, page, vma, flags)) { + page = NULL; + goto out; + } + if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /* * Only return device mapping pages in the FOLL_GET or FOLL_PIN @@ -986,17 +1013,6 @@ static int faultin_page(struct vm_area_struct *vma, return -EBUSY; }
- /* - * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when - * necessary, even if maybe_mkwrite decided not to set pte_write. We - * can thus safely do subsequent page lookups as if they were reads. - * But only do so when looping for pte_write is futile: in some cases - * userspace may also be wanting to write to the gotten user page, - * which a read fault here might prevent (a readonly page might get - * reCOWed by userspace write). - */ - if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE)) - *flags |= FOLL_COW; return 0; }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8a7c1b344abe..352b5220e95e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1040,12 +1040,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
assert_spin_locked(pmd_lockptr(mm, pmd));
- /* - * When we COW a devmap PMD entry, we split it into PTEs, so we should - * not be in this function with `flags & FOLL_COW` set. - */ - WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); - /* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) @@ -1395,14 +1389,23 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) return VM_FAULT_FALLBACK; }
-/* - * FOLL_FORCE can write to even unwritable pmd's, but only - * after we've gone through a COW cycle and they are dirty. - */ -static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) +/* See can_follow_write_pte() on FOLL_FORCE details. */ +static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, + struct vm_area_struct *vma, + unsigned int flags) { - return pmd_write(pmd) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); + if (pmd_write(pmd)) + return true; + if (!(flags & FOLL_FORCE)) + return false; + if (vma->vm_flags & VM_WRITE) + return false; + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) + return false; + if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd)) + return false; + return !userfaultfd_huge_pmd_wp(vma, pmd); }
struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, @@ -1411,12 +1414,16 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned int flags) { struct mm_struct *mm = vma->vm_mm; - struct page *page = NULL; + struct page *page;
assert_spin_locked(pmd_lockptr(mm, pmd));
- if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) - goto out; + page = pmd_page(*pmd); + VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + + if ((flags & FOLL_WRITE) && + !can_follow_write_pmd(*pmd, page, vma, flags)) + return NULL;
/* Avoid dumping huge zero page */ if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) @@ -1424,10 +1431,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
/* Full NUMA hinting faults to serialise migration in fault paths */ if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) - goto out; - - page = pmd_page(*pmd); - VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + return NULL;
if (!pmd_write(*pmd) && gup_must_unshare(flags, page)) return ERR_PTR(-EMLINK); @@ -1444,7 +1448,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-out: return page; }
base-commit: 1612c382ffbdf1f673caec76502b1c00e6d35363
On 08.08.22 09:32, David Hildenbrand wrote:
Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know that FOLL_FORCE can be possibly dangerous, especially if there are races that can be exploited by user space.
Right now, it would be sufficient to have some code that sets a PTE of a R/O-mapped shared page dirty, in order for it to erroneously become writable by FOLL_FORCE. The implications of setting a write-protected PTE dirty might not be immediately obvious to everyone.
And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map a shmem page R/O while marking the pte dirty. This can be used by unprivileged user space to modify tmpfs/shmem file content even if the user does not have write permissions to the file -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
To fix such security issues for good, the insight is that we really only need that fancy retry logic (FOLL_COW) for COW mappings that are not writable (!VM_WRITE). And in a COW mapping, we really only broke COW if we have an exclusive anonymous page mapped. If we have something else mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), we have to trigger a write fault to break COW. If we don't find an exclusive anonymous page when we retry, we have to trigger COW breaking once again because something intervened.
Let's move away from this mandatory-retry + dirty handling and rely on our PageAnonExclusive() flag for making a similar decision, to use the same COW logic as in other kernel parts here as well. In case we stumble over a PTE in a COW mapping that does not map an exclusive anonymous page, COW was not properly broken and we have to trigger a fake write-fault to break COW.
Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection") and commit 76aefad628aa ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take care of softdirty and uffd-wp manually.
For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA.
This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So let's just get rid of it.
I have to add here:
"Thanks to Nadav Amit for pointing out that the pte_dirty() check in FOLL_FORCE code is problematic and might be exploitable."
I'm still reading through this, but
STOP DOING THIS
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
Using BUG_ON() for debugging is simply not ok.
And saying "but it's just a VM_BUG_ON()" does not change *anything*. At least Fedora enables that unconditionally for normal people, it is not some kind of "only VM people do this".
Really. BUG_ON() IS NOT FOR DEBUGGING.
Stop it. Now.
If you have a condition that must not happen, you either write that condition into the code, or - if you are convinced it cannot happen - you make it a WARN_ON_ONCE() so that people can report it to you.
The BUG_ON() will just make the machine die.
And for the facebooks and googles of the world, the WARN_ON() will be sufficient.
Linus
On 09.08.22 20:27, Linus Torvalds wrote:
I'm still reading through this, but
STOP DOING THIS
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
Using BUG_ON() for debugging is simply not ok.
And saying "but it's just a VM_BUG_ON()" does not change *anything*. At least Fedora enables that unconditionally for normal people, it is not some kind of "only VM people do this".
Really. BUG_ON() IS NOT FOR DEBUGGING.
I totally agree with BUG_ON ... but if I get talked to in all-caps on a Thursday evening and feel like I just touched the forbidden fruit, I have to ask for details about VM_BUG_ON nowadays.
VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some kind of debugging at least to me. I *know* that Fedora enables it and I *know* that this will make Fedora crash.
I know why Fedora enables this debug option, but it somewhat destorys the whole purpose of VM_BUG_ON kind of nowadays?
For this case, this condition will never trigger and I consider it much more a hint to the reader that we can rest assured that this condition holds. And on production systems, it will get optimized out.
Should we forbid any new usage of VM_BUG_ON just like we mostly do with BUG_ON?
Stop it. Now.
If you have a condition that must not happen, you either write that condition into the code, or - if you are convinced it cannot happen - you make it a WARN_ON_ONCE() so that people can report it to you.
I can just turn that into a WARN_ON_ONCE() or even a VM_WARN_ON_ONCE().
On Tue, Aug 9, 2022 at 11:45 AM David Hildenbrand david@redhat.com wrote:
I totally agree with BUG_ON ... but if I get talked to in all-caps on a Thursday evening and feel like I just touched the forbidden fruit, I have to ask for details about VM_BUG_ON nowadays.
VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some kind of debugging at least to me. I *know* that Fedora enables it and I *know* that this will make Fedora crash.
No.
VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally no different, the only difference is "we can make the code smaller because these are less important".
The only possible case where BUG_ON can validly be used is "I have some fundamental data corruption and cannot possibly return an error".
This kind of "I don't think this can happen" is _never_ an excuse for it.
Honestly, 99% of our existing BUG_ON() ones are completely bogus, and left-over debug code that wasn't removed because they never triggered. I've several times considered just using a coccinelle script to remove every single BUG_ON() (and VM_BUG_ON()) as simply bogus. Because they are pure noise.
I just tried to find a valid BUG_ON() that would make me go "yeah, that's actually worth it", and couldn't really find one. Yeah, there are several ones in the scheduler that make me go "ok, if that triggers, the machine is dead anyway", so in that sense there are certainly BUG_ON()s that don't _hurt_.
But as a very good approximation, the rule is "absolutely no new BUG_ON() calls _ever_". Because I really cannot see a single case where "proper error handling and WARN_ON_ONCE()" isn't the right thing.
Now, that said, there is one very valid sub-form of BUG_ON(): BUILD_BUG_ON() is absolutely 100% fine.
Linus
On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote:
But as a very good approximation, the rule is "absolutely no new BUG_ON() calls _ever_". Because I really cannot see a single case where "proper error handling and WARN_ON_ONCE()" isn't the right thing.
Parallel to this discussion I've had ones where people more or less say
Since BUG_ON crashes the machine and Linus says that crashing the machine is bad, WARN_ON will also crash the machine if you set the panic_on_warn parameter, so it is also bad, thus we shouldn't use anything.
I've generally maintained that people who set the panic_on_warn *want* these crashes, because that is the entire point of it. So we should use WARN_ON with an error recovery for "can't happen" assertions like these. I think it is what you are saying here.
Jason
On Tue, Aug 9, 2022 at 12:09 PM Jason Gunthorpe jgg@nvidia.com wrote:
Since BUG_ON crashes the machine and Linus says that crashing the machine is bad, WARN_ON will also crash the machine if you set the panic_on_warn parameter, so it is also bad, thus we shouldn't use anything.
If you set 'panic_on_warn' you get to keep both pieces when something breaks.
The thing is, there are people who *do* want to stop immediately when something goes wrong in the kernel.
Anybody doing large-scale virtualization presumably has all the infrastructure to get debug info out of the virtual environment.
And people who run controlled loads in big server machine setups and have a MIS department to manage said machines typically also prefer for a machine to just crash over continuing.
So in those situations, a dead machine is still a dead machine, but you get the information out, and panic_on_warn is fine, because panic and reboot is fine.
And yes, that's actually a fairly common case. Things like syzkaller etc *wants* to abort on the first warning, because that's kind of the point.
But while that kind of virtualized automation machinery is very very common, and is a big deal, it's by no means the only deal, and the most important thing to the point where nothing else matters.
And if you are *not* in a farm, and if you are *not* using virtualization, a dead machine is literally a useless brick. Nobody has serial lines on individual machines any more. In most cases, the hardware literally doesn't even exist any more.
So in that situation, you really cannot afford to take the approach of "just kill the machine". If you are on a laptop and are doing power management code, you generally cannot do that in a virtual environment, and you already have enough problems with suspend and resume being hard to debug, without people also going "oh, let's just BUG_ON() and kill the machine".
Because the other side of that "we have a lot of machine farms doing automated testing" is that those machine farms do not generally find a lot of the exciting cases.
Almost every single merge window, I end up having to bisect and report an oops or a WARN_ON(), because I actually run on real hardware. And said problem was never seen in linux-next.
So we have two very different cases: the "virtual machine with good logging where a dead machine is fine" - use 'panic_on_warn'. And the actual real hardware with real drivers, running real loads by users.
Both are valid. But the second case means that BUG_ON() is basically _never_ valid.
Linus
From: Jason Gunthorpe
Sent: 09 August 2022 20:08
On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote:
But as a very good approximation, the rule is "absolutely no new BUG_ON() calls _ever_". Because I really cannot see a single case where "proper error handling and WARN_ON_ONCE()" isn't the right thing.
Parallel to this discussion I've had ones where people more or less say
Since BUG_ON crashes the machine and Linus says that crashing the machine is bad, WARN_ON will also crash the machine if you set the panic_on_warn parameter, so it is also bad, thus we shouldn't use anything.
I've generally maintained that people who set the panic_on_warn *want* these crashes, because that is the entire point of it. So we should use WARN_ON with an error recovery for "can't happen" assertions like these. I think it is what you are saying here.
They don't necessarily want the crashes, it is more the people who built the distribution think they want the crashes.
I have had issues with a customer system (with our drivers) randomly locking up. Someone had decided that 'PANIC_ON_OOPS' was a good idea but hadn't enabled anything to actually take the dump.
So instead of a diagnosable problem (and a 'doh' moment) you get several weeks of head scratching and a very annoyed user.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
* Linus Torvalds torvalds@linux-foundation.org wrote:
I just tried to find a valid BUG_ON() that would make me go "yeah, that's actually worth it", and couldn't really find one. Yeah, there are several ones in the scheduler that make me go "ok, if that triggers, the machine is dead anyway", so in that sense there are certainly BUG_ON()s that don't _hurt_.
That's a mostly accidental, historical accumulation of BUG_ON()s - I believe we can change all of them to WARN_ON() via the patch below.
As far as the scheduler is concerned, we don't need any BUG_ON()s.
[ This assumes that printk() itself is atomic and non-recursive wrt. the scheduler in these code paths ... ]
Thanks,
Ingo
===============> From: Ingo Molnar mingo@kernel.org Date: Thu, 11 Aug 2022 08:54:52 +0200 Subject: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON()
There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either.
By using a WARN_ON() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs.
None of these WARN_ON()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON().
There's one exception: WARN_ON() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON(), such as in:
- BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON(!lock_task_sighand(p, &flags))) + return;
Signed-off-by: Ingo Molnar mingo@kernel.org --- kernel/sched/autogroup.c | 3 ++- kernel/sched/core.c | 2 +- kernel/sched/cpupri.c | 2 +- kernel/sched/deadline.c | 26 +++++++++++++------------- kernel/sched/fair.c | 10 +++++----- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 6 +++--- 7 files changed, 26 insertions(+), 25 deletions(-)
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c index 4ebaf97f7bd8..13f6b6da35a0 100644 --- a/kernel/sched/autogroup.c +++ b/kernel/sched/autogroup.c @@ -161,7 +161,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag) struct task_struct *t; unsigned long flags;
- BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON(!lock_task_sighand(p, &flags))) + return;
prev = p->signal->autogroup; if (prev == ag) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d3d61cbb6b3c..f84206bf42cd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2328,7 +2328,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, rq = cpu_rq(new_cpu);
rq_lock(rq, rf); - BUG_ON(task_cpu(p) != new_cpu); + WARN_ON(task_cpu(p) != new_cpu); activate_task(rq, p, 0); check_preempt_curr(rq, p, 0);
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index fa9ce9d83683..9f719e4ea081 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -147,7 +147,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, int task_pri = convert_prio(p->prio); int idx, cpu;
- BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES); + WARN_ON(task_pri >= CPUPRI_NR_PRIORITIES);
for (idx = 0; idx < task_pri; idx++) {
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 1d9c90958baa..fb234077c317 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -310,7 +310,7 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw) { struct rq *rq;
- BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV); + WARN_ON(p->dl.flags & SCHED_FLAG_SUGOV);
if (task_on_rq_queued(p)) return; @@ -607,7 +607,7 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p) { struct rb_node *leftmost;
- BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks)); + WARN_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
leftmost = rb_add_cached(&p->pushable_dl_tasks, &rq->dl.pushable_dl_tasks_root, @@ -684,7 +684,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p * Failed to find any suitable CPU. * The task will never come back! */ - BUG_ON(dl_bandwidth_enabled()); + WARN_ON(dl_bandwidth_enabled());
/* * If admission control is disabled we @@ -830,7 +830,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq);
- BUG_ON(pi_of(dl_se)->dl_runtime <= 0); + WARN_ON(pi_of(dl_se)->dl_runtime <= 0);
/* * This could be the case for a !-dl task that is boosted. @@ -1616,7 +1616,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
- BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node)); + WARN_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
rb_add_cached(&dl_se->rb_node, &dl_rq->root, __dl_less);
@@ -1640,7 +1640,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) static void enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) { - BUG_ON(on_dl_rq(dl_se)); + WARN_ON(on_dl_rq(dl_se));
update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
@@ -2017,7 +2017,7 @@ static struct task_struct *pick_task_dl(struct rq *rq) return NULL;
dl_se = pick_next_dl_entity(dl_rq); - BUG_ON(!dl_se); + WARN_ON(!dl_se); p = dl_task_of(dl_se);
return p; @@ -2277,12 +2277,12 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
- BUG_ON(rq->cpu != task_cpu(p)); - BUG_ON(task_current(rq, p)); - BUG_ON(p->nr_cpus_allowed <= 1); + WARN_ON(rq->cpu != task_cpu(p)); + WARN_ON(task_current(rq, p)); + WARN_ON(p->nr_cpus_allowed <= 1);
- BUG_ON(!task_on_rq_queued(p)); - BUG_ON(!dl_task(p)); + WARN_ON(!task_on_rq_queued(p)); + WARN_ON(!dl_task(p));
return p; } @@ -2492,7 +2492,7 @@ static void set_cpus_allowed_dl(struct task_struct *p, struct root_domain *src_rd; struct rq *rq;
- BUG_ON(!dl_task(p)); + WARN_ON(!dl_task(p));
rq = task_rq(p); src_rd = rq->rd; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da388657d5ac..00c01b3232b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2600,7 +2600,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, if (!join) return;
- BUG_ON(irqs_disabled()); + WARN_ON(irqs_disabled()); double_lock_irq(&my_grp->lock, &grp->lock);
for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) { @@ -7279,7 +7279,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return;
find_matching_se(&se, &pse); - BUG_ON(!pse); + WARN_ON(!pse);
cse_is_idle = se_is_idle(se); pse_is_idle = se_is_idle(pse); @@ -8159,7 +8159,7 @@ static void attach_task(struct rq *rq, struct task_struct *p) { lockdep_assert_rq_held(rq);
- BUG_ON(task_rq(p) != rq); + WARN_ON(task_rq(p) != rq); activate_task(rq, p, ENQUEUE_NOCLOCK); check_preempt_curr(rq, p, 0); } @@ -10134,7 +10134,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out_balanced; }
- BUG_ON(busiest == env.dst_rq); + WARN_ON(busiest == env.dst_rq);
schedstat_add(sd->lb_imbalance[idle], env.imbalance);
@@ -10430,7 +10430,7 @@ static int active_load_balance_cpu_stop(void *data) * we need to fix it. Originally reported by * Bjorn Helgaas on a 128-CPU setup. */ - BUG_ON(busiest_rq == target_rq); + WARN_ON(busiest_rq == target_rq);
/* Search for an sd spanning us and the target CPU. */ rcu_read_lock(); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 054b6711e961..acf9f5ce0c4a 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -843,7 +843,7 @@ static void __disable_runtime(struct rq *rq) * We cannot be left wanting - that would mean some runtime * leaked out of the system. */ - BUG_ON(want); + WARN_ON(want); balanced: /* * Disable all the borrow logic by pretending we have inf diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b0bf2287dd9d..8e5df3bc3483 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2699,8 +2699,8 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2) __acquires(rq1->lock) __acquires(rq2->lock) { - BUG_ON(!irqs_disabled()); - BUG_ON(rq1 != rq2); + WARN_ON(!irqs_disabled()); + WARN_ON(rq1 != rq2); raw_spin_rq_lock(rq1); __acquire(rq2->lock); /* Fake it out ;) */ double_rq_clock_clear_update(rq1, rq2); @@ -2716,7 +2716,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2) __releases(rq1->lock) __releases(rq2->lock) { - BUG_ON(rq1 != rq2); + WARN_ON(rq1 != rq2); raw_spin_rq_unlock(rq1); __release(rq2->lock); }
On Thu, Aug 11, 2022 at 12:13 AM Ingo Molnar mingo@kernel.org wrote:
By using a WARN_ON() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs.
None of these WARN_ON()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON().
May I suggest going one step further, and making these WARN_ON_ONCE() instead.
From personal experience, once some scheduler bug (or task struct corruption) happens, ti often *keeps* happening, and the logs just fill up with more and more data, to the point where you lose sight of the original report (and the machine can even get unusable just from the logging).
WARN_ON_ONCE() can help that situation.
Now, obviously
(a) WARN_ON_ONCE *can* also result in less information, and maybe there are situations where having more - possibly different - cases of the same thing triggering could be useful.
(b) WARN_ON_ONCE historically generated a bit bigger code than WARN_ON simply due to the extra "did this already trigger" check.
I *think* (b) is no longer true, and it's just a flag these days, but I didn't actually check.
so it's not like there aren't potential downsides, but in general I think the sanest and most natural thing is to have BUG_ON() translate to WARN_ON_ONCE().
For the "reboot-on-warn" people, it ends up being the same thing. And for the rest of us, the "give me *one* warning" can end up making the reporting a lot easier.
Obviously, with the "this never actually happens", the whole "once or many times" is kind of moot. But if it never happens at all, to the point where it doesn't even add a chance of helping debugging, maybe the whole test should be removed entirely...
Linus
On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote:
May I suggest going one step further, and making these WARN_ON_ONCE() instead.
From personal experience, once some scheduler bug (or task struct
corruption) happens, ti often *keeps* happening, and the logs just fill up with more and more data, to the point where you lose sight of the original report (and the machine can even get unusable just from the logging).
I've been thinking about magically turning all the WARN_ON_ONCE() into (effectively) WARN_ON_RATELIMIT(). I had some patches in that direction a while ago but never got round to tidying them up for submission.
On Thu, Aug 11, 2022 at 10:28:27PM +0100, Matthew Wilcox wrote:
On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote:
May I suggest going one step further, and making these WARN_ON_ONCE() instead.
From personal experience, once some scheduler bug (or task struct
corruption) happens, ti often *keeps* happening, and the logs just fill up with more and more data, to the point where you lose sight of the original report (and the machine can even get unusable just from the logging).
I've been thinking about magically turning all the WARN_ON_ONCE() into (effectively) WARN_ON_RATELIMIT(). I had some patches in that direction a while ago but never got round to tidying them up for submission.
I often wonder if we have a justification for WARN_ON to even exist, I see a lot of pressure to make things into WARN_ON_ONCE based on the logic that spamming makes it useless..
Maybe a global limit of 10 warn ons per minute or something would be interesting?
Jason
On 8/11/22 16:22, Jason Gunthorpe wrote:
On Thu, Aug 11, 2022 at 10:28:27PM +0100, Matthew Wilcox wrote:
On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote:
May I suggest going one step further, and making these WARN_ON_ONCE() instead.
From personal experience, once some scheduler bug (or task struct
corruption) happens, ti often *keeps* happening, and the logs just fill up with more and more data, to the point where you lose sight of the original report (and the machine can even get unusable just from the logging).
I've been thinking about magically turning all the WARN_ON_ONCE() into (effectively) WARN_ON_RATELIMIT(). I had some patches in that direction a while ago but never got round to tidying them up for submission.
If you do that, I'd like to suggest that you avoid using magic here, but instead just rename at the call sites.
Because:
First and foremost, something named WARN_ON_ONCE() clearly has a solemn responsibility to warn exactly "once times"! :)
Second, it's not yet clear (or is it?) that WARN_ON_ONCE() is always worse than rate limiting. It's a trade-off, rather than a clear win for either case, in my experience. The _ONCE variant can get overwritten if the kernel log wraps, but the _RATELIMIT on the other hand, may be excessive.
And finally, if it *is* agreed on here that WARN_ON_RATELIMIT() is always better than WARN_ON_ONCE(), then there is still no harm in spending a patch or two (coccinelle...) to rename WARN_ON_ONCE() --> WARN_ON_RATELIMIT(), so that we end up with accurate names.
I often wonder if we have a justification for WARN_ON to even exist, I see a lot of pressure to make things into WARN_ON_ONCE based on the logic that spamming makes it useless..
Agreed. WARN_ON_ONCE() or WARN_ON_RATELIMIT(), take your pick. But not WARN_ON_EVERY_TIME()--that usually causes a serious problems in the logs.
thanks,
* Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, Aug 11, 2022 at 12:13 AM Ingo Molnar mingo@kernel.org wrote:
By using a WARN_ON() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs.
None of these WARN_ON()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON().
May I suggest going one step further, and making these WARN_ON_ONCE() instead.
Ok, agreed.
From personal experience, once some scheduler bug (or task struct corruption) happens, ti often *keeps* happening, and the logs just fill up with more and more data, to the point where you lose sight of the original report (and the machine can even get unusable just from the logging).
WARN_ON_ONCE() can help that situation.
Yeah, true.
Now, obviously
(a) WARN_ON_ONCE *can* also result in less information, and maybe there are situations where having more - possibly different - cases of the same thing triggering could be useful.
None of these warnings are supposed to happen absent serious random data corruption, everâ„¢, so if against expectations they do happen once, somewhere, and its brevity makes it hard to figure out, we can still reconsider on a case by case basis & increase verbosity.
(b) WARN_ON_ONCE historically generated a bit bigger code than WARN_ON simply due to the extra "did this already trigger" check.
I *think* (b) is no longer true, and it's just a flag these days, but I didn't actually check.
Yeah, so on architectures that implement a smart __WARN_FLAGS() primitive, such as x86, overhead is pretty good:
#define __WARN_FLAGS(flags) \ do { \ __auto_type __flags = BUGFLAG_WARNING|(flags); \ instrumentation_begin(); \ _BUG_FLAGS(ASM_UD2, __flags, ASM_REACHABLE); \ instrumentation_end(); \ } while (0)
For them the runtime overhead of a WARN_ON_ONCE() is basically just the check itself:
# no checks:
ffffffff810a3ae0 <check_preempt_curr>: ffffffff810a3ae0: 48 8b 87 20 09 00 00 mov 0x920(%rdi),%rax ffffffff810a3ae7: 48 8b 8e 90 02 00 00 mov 0x290(%rsi),%rcx ffffffff810a3aee: 53 push %rbx ffffffff810a3aef: 48 89 fb mov %rdi,%rbx ffffffff810a3af2: 48 3b 88 90 02 00 00 cmp 0x290(%rax),%rcx
# Single-branch WARN_ON_ONCE(): # # void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) # { # + WARN_ON_ONCE(!rq); # + # if (p->sched_class == rq->curr->sched_class) # rq->curr->sched_class->check_preempt_curr(rq, p, flags); # else if (sched_class_above(p->sched_class, rq->curr->sched_class))
ffffffff810a3ae0 <check_preempt_curr>: ffffffff810a3ae0: 53 push %rbx ffffffff810a3ae1: 48 89 fb mov %rdi,%rbx ffffffff810a3ae4: | 48 85 ff test %rdi,%rdi ffffffff810a3ae7: | 74 69 je ffffffff810a3b52 <check_preempt_curr+0x72> ffffffff810a3ae9: 48 8b 83 20 09 00 00 mov 0x920(%rbx),%rax ffffffff810a3af0: 48 8b 8e 90 02 00 00 mov 0x290(%rsi),%rcx ffffffff810a3af7: 48 3b 88 90 02 00 00 cmp 0x290(%rax),%rcx ... ffffffff810a3b50: eb d1 jmp ffffffff810a3b23 <check_preempt_curr+0x43> # tail-call ffffffff810a3b52: | 0f 0b ud2
So it's a test instruction and an unlikely-forward-branch, plus a UD2 trap squeezed into the function epilogue, often hidden by alignment holes.
As lightweight as it gets, and no in-line penalty from being a 'once' warning.
so it's not like there aren't potential downsides, but in general I think the sanest and most natural thing is to have BUG_ON() translate to WARN_ON_ONCE().
For the "reboot-on-warn" people, it ends up being the same thing. And for the rest of us, the "give me *one* warning" can end up making the reporting a lot easier.
Agreed - updated v2 patch attached.
Obviously, with the "this never actually happens", the whole "once or many times" is kind of moot. But if it never happens at all, to the point where it doesn't even add a chance of helping debugging, maybe the whole test should be removed entirely...
Yeah.
Thanks,
Ingo
========================================
From: Ingo Molnar mingo@kernel.org Date: Thu, 11 Aug 2022 08:54:52 +0200 Subject: [PATCH] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either.
By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs.
None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE().
There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in:
- BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return;
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ingo Molnar mingo@kernel.org Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com --- kernel/sched/autogroup.c | 3 ++- kernel/sched/core.c | 2 +- kernel/sched/cpupri.c | 2 +- kernel/sched/deadline.c | 26 +++++++++++++------------- kernel/sched/fair.c | 10 +++++----- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 6 +++--- 7 files changed, 26 insertions(+), 25 deletions(-)
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c index 4ebaf97f7bd8..991fc9002535 100644 --- a/kernel/sched/autogroup.c +++ b/kernel/sched/autogroup.c @@ -161,7 +161,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag) struct task_struct *t; unsigned long flags;
- BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return;
prev = p->signal->autogroup; if (prev == ag) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ee28253c9ac0..813687a5f5cf 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2328,7 +2328,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, rq = cpu_rq(new_cpu);
rq_lock(rq, rf); - BUG_ON(task_cpu(p) != new_cpu); + WARN_ON_ONCE(task_cpu(p) != new_cpu); activate_task(rq, p, 0); check_preempt_curr(rq, p, 0);
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index fa9ce9d83683..a286e726eb4b 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -147,7 +147,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, int task_pri = convert_prio(p->prio); int idx, cpu;
- BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES); + WARN_ON_ONCE(task_pri >= CPUPRI_NR_PRIORITIES);
for (idx = 0; idx < task_pri; idx++) {
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 0ab79d819a0d..962b169b05cf 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -310,7 +310,7 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw) { struct rq *rq;
- BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV); + WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
if (task_on_rq_queued(p)) return; @@ -607,7 +607,7 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p) { struct rb_node *leftmost;
- BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks)); + WARN_ON_ONCE(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
leftmost = rb_add_cached(&p->pushable_dl_tasks, &rq->dl.pushable_dl_tasks_root, @@ -684,7 +684,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p * Failed to find any suitable CPU. * The task will never come back! */ - BUG_ON(dl_bandwidth_enabled()); + WARN_ON_ONCE(dl_bandwidth_enabled());
/* * If admission control is disabled we @@ -830,7 +830,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq);
- BUG_ON(pi_of(dl_se)->dl_runtime <= 0); + WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0);
/* * This could be the case for a !-dl task that is boosted. @@ -1616,7 +1616,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
- BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node)); + WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
rb_add_cached(&dl_se->rb_node, &dl_rq->root, __dl_less);
@@ -1640,7 +1640,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) static void enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) { - BUG_ON(on_dl_rq(dl_se)); + WARN_ON_ONCE(on_dl_rq(dl_se));
update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
@@ -2017,7 +2017,7 @@ static struct task_struct *pick_task_dl(struct rq *rq) return NULL;
dl_se = pick_next_dl_entity(dl_rq); - BUG_ON(!dl_se); + WARN_ON_ONCE(!dl_se); p = dl_task_of(dl_se);
return p; @@ -2277,12 +2277,12 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
- BUG_ON(rq->cpu != task_cpu(p)); - BUG_ON(task_current(rq, p)); - BUG_ON(p->nr_cpus_allowed <= 1); + WARN_ON_ONCE(rq->cpu != task_cpu(p)); + WARN_ON_ONCE(task_current(rq, p)); + WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
- BUG_ON(!task_on_rq_queued(p)); - BUG_ON(!dl_task(p)); + WARN_ON_ONCE(!task_on_rq_queued(p)); + WARN_ON_ONCE(!dl_task(p));
return p; } @@ -2492,7 +2492,7 @@ static void set_cpus_allowed_dl(struct task_struct *p, struct root_domain *src_rd; struct rq *rq;
- BUG_ON(!dl_task(p)); + WARN_ON_ONCE(!dl_task(p));
rq = task_rq(p); src_rd = rq->rd; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 914096c5b1ae..28f10dccd194 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2600,7 +2600,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, if (!join) return;
- BUG_ON(irqs_disabled()); + WARN_ON_ONCE(irqs_disabled()); double_lock_irq(&my_grp->lock, &grp->lock);
for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) { @@ -7279,7 +7279,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return;
find_matching_se(&se, &pse); - BUG_ON(!pse); + WARN_ON_ONCE(!pse);
cse_is_idle = se_is_idle(se); pse_is_idle = se_is_idle(pse); @@ -8159,7 +8159,7 @@ static void attach_task(struct rq *rq, struct task_struct *p) { lockdep_assert_rq_held(rq);
- BUG_ON(task_rq(p) != rq); + WARN_ON_ONCE(task_rq(p) != rq); activate_task(rq, p, ENQUEUE_NOCLOCK); check_preempt_curr(rq, p, 0); } @@ -10134,7 +10134,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out_balanced; }
- BUG_ON(busiest == env.dst_rq); + WARN_ON_ONCE(busiest == env.dst_rq);
schedstat_add(sd->lb_imbalance[idle], env.imbalance);
@@ -10430,7 +10430,7 @@ static int active_load_balance_cpu_stop(void *data) * we need to fix it. Originally reported by * Bjorn Helgaas on a 128-CPU setup. */ - BUG_ON(busiest_rq == target_rq); + WARN_ON_ONCE(busiest_rq == target_rq);
/* Search for an sd spanning us and the target CPU. */ rcu_read_lock(); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 55f39c8f4203..2936fe55cef7 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -843,7 +843,7 @@ static void __disable_runtime(struct rq *rq) * We cannot be left wanting - that would mean some runtime * leaked out of the system. */ - BUG_ON(want); + WARN_ON_ONCE(want); balanced: /* * Disable all the borrow logic by pretending we have inf diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e26688d387ae..7a44dceeb50a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2709,8 +2709,8 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2) __acquires(rq1->lock) __acquires(rq2->lock) { - BUG_ON(!irqs_disabled()); - BUG_ON(rq1 != rq2); + WARN_ON_ONCE(!irqs_disabled()); + WARN_ON_ONCE(rq1 != rq2); raw_spin_rq_lock(rq1); __acquire(rq2->lock); /* Fake it out ;) */ double_rq_clock_clear_update(rq1, rq2); @@ -2726,7 +2726,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2) __releases(rq1->lock) __releases(rq2->lock) { - BUG_ON(rq1 != rq2); + WARN_ON_ONCE(rq1 != rq2); raw_spin_rq_unlock(rq1); __release(rq2->lock); }
On Fri, Aug 12, 2022 at 11:29:18AM +0200, Ingo Molnar wrote:
From: Ingo Molnar mingo@kernel.org Date: Thu, 11 Aug 2022 08:54:52 +0200 Subject: [PATCH] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either.
By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs.
None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE().
There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in:
BUG_ON(!lock_task_sighand(p, &flags));
if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
return;
Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ingo Molnar mingo@kernel.org Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com
kernel/sched/autogroup.c | 3 ++- kernel/sched/core.c | 2 +- kernel/sched/cpupri.c | 2 +- kernel/sched/deadline.c | 26 +++++++++++++------------- kernel/sched/fair.c | 10 +++++----- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 6 +++--- 7 files changed, 26 insertions(+), 25 deletions(-)
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index fa9ce9d83683..a286e726eb4b 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -147,7 +147,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, int task_pri = convert_prio(p->prio); int idx, cpu;
- BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);
- WARN_ON_ONCE(task_pri >= CPUPRI_NR_PRIORITIES);
for (idx = 0; idx < task_pri; idx++) {
Should the return value be used here to clamp task_pri to CPUPRI_NR_PRIORITIES? task_pri is used for index which in __cpupri_find then accesses beyond the end of an array and the future behaviour will be very unpredictable.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 0ab79d819a0d..962b169b05cf 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2017,7 +2017,7 @@ static struct task_struct *pick_task_dl(struct rq *rq) return NULL; dl_se = pick_next_dl_entity(dl_rq);
- BUG_ON(!dl_se);
- WARN_ON_ONCE(!dl_se); p = dl_task_of(dl_se);
return p;
It's a somewhat redundant check, it'll NULL pointer dereference shortly afterwards but it'll be a bit more obvious.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 914096c5b1ae..28f10dccd194 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2600,7 +2600,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, if (!join) return;
- BUG_ON(irqs_disabled());
- WARN_ON_ONCE(irqs_disabled()); double_lock_irq(&my_grp->lock, &grp->lock);
for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
Recoverable with a goto no_join. It'll be a terrible recovery because there is no way IRQs should be disabled here. Something else incredibly bad happened before this would fire.
@@ -7279,7 +7279,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return; find_matching_se(&se, &pse);
- BUG_ON(!pse);
- WARN_ON_ONCE(!pse);
cse_is_idle = se_is_idle(se); pse_is_idle = se_is_idle(pse);
Similar to pick_task_dl.
@@ -10134,7 +10134,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out_balanced; }
- BUG_ON(busiest == env.dst_rq);
- WARN_ON_ONCE(busiest == env.dst_rq);
schedstat_add(sd->lb_imbalance[idle], env.imbalance);
goto out if it triggers? It'll just continue to be unbalanced.
@@ -10430,7 +10430,7 @@ static int active_load_balance_cpu_stop(void *data) * we need to fix it. Originally reported by * Bjorn Helgaas on a 128-CPU setup. */
- BUG_ON(busiest_rq == target_rq);
- WARN_ON_ONCE(busiest_rq == target_rq);
/* Search for an sd spanning us and the target CPU. */ rcu_read_lock();
goto out_unlock if it fires?
For the rest, I didn't see obvious recovery paths that would allow the system to run predictably. Any of them firing will have unpredictable consequences (e.g. move_queued_task firing would be fun if it was a per-cpu kthread). Depending on which warning triggers, the remaining life of the system may be very short but maybe long enough to be logged even if system locks up shortly afterwards.
On 8/15/22 07:41, Mel Gorman wrote:
For the rest, I didn't see obvious recovery paths that would allow the system to run predictably. Any of them firing will have unpredictable consequences (e.g. move_queued_task firing would be fun if it was a per-cpu kthread). Depending on which warning triggers, the remaining life of the system may be very short but maybe long enough to be logged even if system locks up shortly afterwards.
Hi Mel,
Are you basically saying, "WARN_ON*() seems acceptable in mm/, because we can at least get the problem logged before it locks up, probably"?
Or are you implying that we should instead introduce and use some new PANIC_ON() or VM_PANIC_ON() set of macros that would allow proper "log then reboot" behavior?
thanks,
On Mon, Aug 15, 2022 at 03:12:19PM -0700, John Hubbard wrote:
On 8/15/22 07:41, Mel Gorman wrote:
For the rest, I didn't see obvious recovery paths that would allow the system to run predictably. Any of them firing will have unpredictable consequences (e.g. move_queued_task firing would be fun if it was a per-cpu kthread). Depending on which warning triggers, the remaining life of the system may be very short but maybe long enough to be logged even if system locks up shortly afterwards.
Hi Mel,
Are you basically saying, "WARN_ON*() seems acceptable in mm/, because we can at least get the problem logged before it locks up, probably"?
I don't consider this to be a subsystem-specific guideline and I am certainly not suggesting that mm/ is the gold standard but my understanding was "If a warning is recoverable in a sensible fashion then do so". A warning is always bad, but it may be possible for the system to continue long enough for the warning to be logged or an admin to schedule a reboot at a controlled time. I think Ingo is doing the right thing with this patch to avoid an unnecessary panic but for at least some of them, a partial recovery is possible so why not do it seeing as it's been changed anyway?
The outcome of a warning is very variable, it might be a coding error where state is unexpected but it can be reset to a known state, maybe some data is leaked that if it persists we eventually OOM, maybe a particular feature will no longer work as expected (e.g. load balancing doesn't balance load due to a warning) or maybe the whole system will fail completely in the near future, possibly before an error can be logged. In the scheduler side, a warning may mean that a task never runs again and another means that a task is in the wrong scheduling class which means ... no idea, but it's not good if it's a deadline task that gets manipulated by the normal scheduling class instead.
For mm/, take three examples
1. compaction.c warning. PFN ranges are unexpected, reset them. Compaction is less effective and opportunities were missed but the system will not blow up. Maybe at worst a hugetlb allocation would fail when it might otherwise have succeeded or maybe SLUB degrades because it can't use a high-order page for a cache that could reside in an order-0 page instead. It varies.
2. Warning in in filemap_unaccount_folio -- possible loss of unwritten data on normal filesystems. Bad, might cause inconsistency for state but maybe it'll be dirtied again and it'll recover eventually. It's certainly not good if that data got lost forever.
3. The warning in isolate_movable_page is bad, page flags state is either corrupted by a driver or maybe there was a bit flip error but a page may be migrated unexpectedly leading to Who Knows What, corruption of a page that gets freed and reused by another process? It's much worse than compaction restoring its internal state of where it is scanning but maybe not a complete disaster.
Or are you implying that we should instead introduce and use some new PANIC_ON() or VM_PANIC_ON() set of macros that would allow proper "log then reboot" behavior?
No, not as a general rule. VM_PANIC_ON would suggest it's under DEBUG_VM which is not always set and I do not see how either PANIC_ON or VM_PANIC_ON could be coded in such a way that it always gets logged because it depends on the context the bad condition occurred.
I also don't think the kernel could conditionally warn or panic for a specific instance because if an admin wants a panic on a warning (can be very useful for a crash dump), then panic_on_warn is available.
Either way, PANIC_ON or VM_PANIC_ON is outside the scope of this patch.
My understanding was that a panic() should only be used in the case where the kernel is screwed and if tries to continue, the outcome will be much worse or it's early in boot and the error condition means it'll be impossible to boot anyway.
* Mel Gorman mgorman@suse.de wrote:
For the rest, I didn't see obvious recovery paths that would allow the system to run predictably. Any of them firing will have unpredictable consequences (e.g. move_queued_task firing would be fun if it was a per-cpu kthread). Depending on which warning triggers, the remaining life of the system may be very short but maybe long enough to be logged even if system locks up shortly afterwards.
Correct. I'd prefer to keep all these warnings 'simple' - i.e. no attempted recovery & control flow, unless we ever expect these to trigger.
I.e. instead of adding a 'goto' I'd prefer if we removed most of the ones you highlighted. But wanted to keep this first patch simple.
Thanks,
Ingo
On Sun, Aug 21, 2022 at 01:28:58PM +0200, Ingo Molnar wrote:
- Mel Gorman mgorman@suse.de wrote:
For the rest, I didn't see obvious recovery paths that would allow the system to run predictably. Any of them firing will have unpredictable consequences (e.g. move_queued_task firing would be fun if it was a per-cpu kthread). Depending on which warning triggers, the remaining life of the system may be very short but maybe long enough to be logged even if system locks up shortly afterwards.
Correct. I'd prefer to keep all these warnings 'simple' - i.e. no attempted recovery & control flow, unless we ever expect these to trigger.
I generally hope we never expect a warning to trigger until of course it does :P
I.e. instead of adding a 'goto' I'd prefer if we removed most of the ones you highlighted. But wanted to keep this first patch simple.
The only one I would push back on is claming cpupri_find_fitness() so it does not access outside the bounds of an array after a warning triggers.
For the rest, I've no strong objection to recovering given the nature of the BUGs you are converting. If any of them *did* trigger, it would be more important to fix the issue than recover the warning. So other than the out-of-bounds array access which I'd like to see clamped just in case;
Acked-by: Mel Gorman mgorman@suse.de
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA.
So this made me go "Whaa?"
I didn't even realize that the media drivers and rdma used FOLL_FORCE.
That's just completely bogus.
Why do they do that?
It seems to be completely bogus, and seems to have no actual valid reason for it. Looking through the history, it goes back to the original code submission in 2006, and doesn't have a mention of why.
I think the original reason was that the code didn't have pinning, so it used "do a write" as a pin mechanism - even for reads.
IOW, I think the non-ptrace use of FOLL_FORCE should just be removed.
Linus
On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote:
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA.
So this made me go "Whaa?"
I didn't even realize that the media drivers and rdma used FOLL_FORCE.
That's just completely bogus.
Why do they do that?
It seems to be completely bogus, and seems to have no actual valid reason for it. Looking through the history, it goes back to the original code submission in 2006, and doesn't have a mention of why.
It is because of all this madness with COW.
Lets say an app does:
buf = mmap(MAP_PRIVATE) rdma_pin_for_read(buf) buf[0] = 1
Then the store to buf[0] will COW the page and the pin will become decoherent.
The purpose of the FORCE is to force COW to happen early so this is avoided.
Sadly there are real apps that end up working this way, eg because they are using buffer in .data or something.
I've hoped David's new work on this provided some kind of path to a proper solution, as the need is very similar to all the other places where we need to ensure there is no possiblity of future COW.
So, these usage can be interpreted as a FOLL flag we don't have - some kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive sort of idea.
Jason
On 09.08.22 20:48, Jason Gunthorpe wrote:
On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote:
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA.
So this made me go "Whaa?"
I didn't even realize that the media drivers and rdma used FOLL_FORCE.
That's just completely bogus.
Why do they do that?
It seems to be completely bogus, and seems to have no actual valid reason for it. Looking through the history, it goes back to the original code submission in 2006, and doesn't have a mention of why.
It is because of all this madness with COW.
Lets say an app does:
buf = mmap(MAP_PRIVATE) rdma_pin_for_read(buf) buf[0] = 1
Then the store to buf[0] will COW the page and the pin will become decoherent.
The purpose of the FORCE is to force COW to happen early so this is avoided.
Sadly there are real apps that end up working this way, eg because they are using buffer in .data or something.
I've hoped David's new work on this provided some kind of path to a proper solution, as the need is very similar to all the other places where we need to ensure there is no possiblity of future COW.
So, these usage can be interpreted as a FOLL flag we don't have - some kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive sort of idea.
Thanks Jason for the explanation.
I have patches in the works to no longer use FOLL_FORCE|FOLL_WRITE for taking a reliable longerterm R/O pin in a MAP_PRIVATE mapping. The patches are mostly done (and comparably simple), I merely deferred sending them out because I stumbled over this issue first.
Some ugly corner cases of MAP_SHARED remain, but for most prominent use cases, my upcoming patches should allow us to just have longterm R/O pins working as expected.
On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe jgg@nvidia.com wrote:
It is because of all this madness with COW.
Yes, yes, but we have the proper long-term pinning now with PG_anon_exclusive, and it actually gets the pinning right not just over COW, but even over a fork - which that early write never did.
David, I thought all of that got properly merged? Is there something still missing?
Linus
On 09.08.22 21:07, Linus Torvalds wrote:
On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe jgg@nvidia.com wrote:
It is because of all this madness with COW.
Yes, yes, but we have the proper long-term pinning now with PG_anon_exclusive, and it actually gets the pinning right not just over COW, but even over a fork - which that early write never did.
David, I thought all of that got properly merged? Is there something still missing?
The only thing to get R/O longterm pins in MAP_PRIVATE correct that's missing is that we have to break COW when taking a R/O longterm pin when *not* finding an anon page inside a private mapping. Regarding anon pages I am not aware of issues (due to PG_anon_exclusive).
If anybody here wants to stare at a webpage, the following commit explains the rough idea for MAP_PRIVATE:
https://github.com/davidhildenbrand/linux/commit/cd7989fb76d2513c86f01e6f7a7...
Once we have that in place, we can mostly get rid of FOLL_FORCE|FOLL_WRITE for R/O longterm pins. There are some corner cases though that need some additional thought which i am still working on. FS-handled COW in MAP_SHARED mappings is just nasty (hello DAX).
(the wrong use of FOLL_GET instead of FOLL_PIN for O_DIRECT and friends still persists, but that's a different thing to handle and it's only problematic with concurrent fork() IIRC)
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler.
This, btw, just makes me go "uffd-wp is broken garbage" once more.
It also makes me go "if uffd-wp can disallow ptrace writes, then why doesn't regular write protect do it"?
IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that absolutely must go away), but I get the strong feelign that we instead should try to get rid of FOLL_FORCE entirely.
If some other user action can stop FOLL_FORCE anyway, then why do we support it at all?
Linus
On 09.08.22 20:48, Linus Torvalds wrote:
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler.
This, btw, just makes me go "uffd-wp is broken garbage" once more.
It also makes me go "if uffd-wp can disallow ptrace writes, then why doesn't regular write protect do it"?
I remember that it's not just uffd-wp, it's also ordinary userfaultfd if we have no page mapped, because we'd have to drop the mmap lock in order to notify user space about the fault and wait for a resolution.
IIUC, we cannot tell what exactly user-space will do as a response to a user write fault here (for example, QEMU VM snapshots have to copy page content away such that the VM snapshot remains consistent and we won't corrupt the snapshot), so we have to back off and fail the GUP. I'd say, for ptrace that's even the right thing to do because one might deadlock waiting on the user space thread that handles faults ... but that's a little off-topic to this fix here. I'm just trying to keep the semantics unchanged, as weird as they might be.
IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that absolutely must go away), but I get the strong feelign that we instead should try to get rid of FOLL_FORCE entirely.
I can resend v2 soonish, taking care of the VM_BUG_ON as you requested if there are no other comments.
If some other user action can stop FOLL_FORCE anyway, then why do we support it at all?
My humble opinion is that debugging userfaultfd-managed memory is a corner case and that we can hopefully stop using FOLL_FORCE outside of debugging context soon.
Having that said, I do enjoy having the uffd and uffd-wp feature available in user space (especially in QEMU). I don't always enjoy having to handle such corner cases in the kernel.
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
So I've read through the patch several times, and it seems fine, but this function (and the pmd version of it) just read oddly to me.
+static inline bool can_follow_write_pte(pte_t pte, struct page *page,
struct vm_area_struct *vma,
unsigned int flags)
+{
if (pte_write(pte))
return true;
if (!(flags & FOLL_FORCE))
return false;
/*
* See check_vma_flags(): only COW mappings need that special
* "force" handling when they lack VM_WRITE.
*/
if (vma->vm_flags & VM_WRITE)
return false;
VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
So apart from the VM_BUG_ON(), this code just looks really strange - even despite the comment. Just conceptually, the whole "if it's writable, return that you cannot follow it for a write" just looks so very very strange.
That doesn't make the code _wrong_, but considering how many times this has had subtle bugs, let's not write code that looks strange.
So I would suggest that to protect against future bugs, we try to make it be fairly clear and straightforward, and maybe even a bit overly protective.
For example, let's kill the "shared mapping that you don't have write permissions to" very explicitly and without any subtle code at all. The vm_flags tests are cheap and easy, and we could very easily just add some core ones to make any mistakes much less critical.
Now, making that 'is_cow_mapping()' check explicit at the very top of this would already go a long way:
/* FOLL_FORCE for writability only affects COW mappings */ if (!is_cow_mapping(vma->vm_flags)) return false;
but I'd actually go even further: in this case that "is_cow_mapping()" helper to some degree actually hides what is going on.
So I'd actually prefer for that function to be written something like
/* If the pte is writable, we can write to the page */ if (pte_write(pte)) return true;
/* Maybe FOLL_FORCE is set to override it? */ if (flags & FOLL_FORCE) return false;
/* But FOLL_FORCE has no effect on shared mappings */ if (vma->vm_flags & MAP_SHARED) return false;
/* .. or read-only private ones */ if (!(vma->vm_flags & MAP_MAYWRITE)) return false;
/* .. or already writable ones that just need to take a write fault */ if (vma->vm_flags & MAP_WRITE) return false;
and the two first vm_flags tests above are basically doing tat "is_cow_mapping()", and maybe we could even have a comment to that effect, but wouldn't it be nice to just write it out that way?
And after you've written it out like the above, now that
if (!page || !PageAnon(page) || !PageAnonExclusive(page)) return false;
makes you pretty safe from a data sharing perspective: it's most definitely not a shared page at that point.
So if you write it that way, the only remaining issues are the magic special soft-dirty and uffd ones, but at that point it's purely about the semantics of those features, no longer about any possible "oh, we fooled some shared page to be writable".
And I think the above is fairly legible without any subtle cases, and the one-liner comments make it all fairly clear that it's testing.
Is any of this in any _technical_ way different from what your patch did? No. It's literally just rewriting it to be a bit more explicit in what it is doing, I think, and it makes that odd "it's not writable if VM_WRITE is set" case a bit more explicit.
Hmm?
Linus
On 09.08.22 22:00, Linus Torvalds wrote:
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
So I've read through the patch several times, and it seems fine, but this function (and the pmd version of it) just read oddly to me.
+static inline bool can_follow_write_pte(pte_t pte, struct page *page,
struct vm_area_struct *vma,
unsigned int flags)
+{
if (pte_write(pte))
return true;
if (!(flags & FOLL_FORCE))
return false;
/*
* See check_vma_flags(): only COW mappings need that special
* "force" handling when they lack VM_WRITE.
*/
if (vma->vm_flags & VM_WRITE)
return false;
VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
So apart from the VM_BUG_ON(), this code just looks really strange - even despite the comment. Just conceptually, the whole "if it's writable, return that you cannot follow it for a write" just looks so very very strange.
That doesn't make the code _wrong_, but considering how many times this has had subtle bugs, let's not write code that looks strange.
So I would suggest that to protect against future bugs, we try to make it be fairly clear and straightforward, and maybe even a bit overly protective.
For example, let's kill the "shared mapping that you don't have write permissions to" very explicitly and without any subtle code at all. The vm_flags tests are cheap and easy, and we could very easily just add some core ones to make any mistakes much less critical.
Now, making that 'is_cow_mapping()' check explicit at the very top of this would already go a long way:
/* FOLL_FORCE for writability only affects COW mappings */ if (!is_cow_mapping(vma->vm_flags)) return false;
I actually put the is_cow_mapping() mapping check in there because check_vma_flags() should make sure that we cannot possibly end up here in that case. But we can spell it out with comments, doesn't hurt.
but I'd actually go even further: in this case that "is_cow_mapping()" helper to some degree actually hides what is going on.
So I'd actually prefer for that function to be written something like
/* If the pte is writable, we can write to the page */ if (pte_write(pte)) return true; /* Maybe FOLL_FORCE is set to override it? */ if (flags & FOLL_FORCE) return false; /* But FOLL_FORCE has no effect on shared mappings */ if (vma->vm_flags & MAP_SHARED) return false; /* .. or read-only private ones */ if (!(vma->vm_flags & MAP_MAYWRITE)) return false; /* .. or already writable ones that just need to take a write fault */ if (vma->vm_flags & MAP_WRITE) return false;
and the two first vm_flags tests above are basically doing tat "is_cow_mapping()", and maybe we could even have a comment to that effect, but wouldn't it be nice to just write it out that way?
And after you've written it out like the above, now that
if (!page || !PageAnon(page) || !PageAnonExclusive(page)) return false;
makes you pretty safe from a data sharing perspective: it's most definitely not a shared page at that point.
So if you write it that way, the only remaining issues are the magic special soft-dirty and uffd ones, but at that point it's purely about the semantics of those features, no longer about any possible "oh, we fooled some shared page to be writable".
And I think the above is fairly legible without any subtle cases, and the one-liner comments make it all fairly clear that it's testing.
Is any of this in any _technical_ way different from what your patch did? No. It's literally just rewriting it to be a bit more explicit in what it is doing, I think, and it makes that odd "it's not writable if VM_WRITE is set" case a bit more explicit.
Hmm?
No strong opinion. I'm happy as long as it's fixed, and the fix is robust.
On 09.08.22 22:00, Linus Torvalds wrote:
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand david@redhat.com wrote:
So I've read through the patch several times, and it seems fine, but this function (and the pmd version of it) just read oddly to me.
+static inline bool can_follow_write_pte(pte_t pte, struct page *page,
struct vm_area_struct *vma,
unsigned int flags)
+{
if (pte_write(pte))
return true;
if (!(flags & FOLL_FORCE))
return false;
/*
* See check_vma_flags(): only COW mappings need that special
* "force" handling when they lack VM_WRITE.
*/
if (vma->vm_flags & VM_WRITE)
return false;
VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
So apart from the VM_BUG_ON(), this code just looks really strange - even despite the comment. Just conceptually, the whole "if it's writable, return that you cannot follow it for a write" just looks so very very strange.
That doesn't make the code _wrong_, but considering how many times this has had subtle bugs, let's not write code that looks strange.
So I would suggest that to protect against future bugs, we try to make it be fairly clear and straightforward, and maybe even a bit overly protective.
For example, let's kill the "shared mapping that you don't have write permissions to" very explicitly and without any subtle code at all. The vm_flags tests are cheap and easy, and we could very easily just add some core ones to make any mistakes much less critical.
Now, making that 'is_cow_mapping()' check explicit at the very top of this would already go a long way:
/* FOLL_FORCE for writability only affects COW mappings */ if (!is_cow_mapping(vma->vm_flags)) return false;
but I'd actually go even further: in this case that "is_cow_mapping()" helper to some degree actually hides what is going on.
So I'd actually prefer for that function to be written something like
/* If the pte is writable, we can write to the page */ if (pte_write(pte)) return true; /* Maybe FOLL_FORCE is set to override it? */ if (flags & FOLL_FORCE) return false; /* But FOLL_FORCE has no effect on shared mappings */ if (vma->vm_flags & MAP_SHARED) return false;
I'd actually rather check for MAP_MAYSHARE here, which is even stronger. Thoughts?
On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand david@redhat.com wrote:
/* But FOLL_FORCE has no effect on shared mappings */ if (vma->vm_flags & MAP_SHARED) return false;
I'd actually rather check for MAP_MAYSHARE here, which is even stronger. Thoughts?
Hmm. Adding the test for both is basically free (all those vm_flags checks end up being a bit mask and compare), so no objections.
For some reason I though VM_SHARED and VM_MAYSHARE end up always being the same bits, and it was a mistake to make them two bits in the first place (unlike the read/write/exec bits that can are about mprotect),
But as there are two bits, I'm sure somebody ends up touching one and not the other.
So yeah, I don't see any downside to just checking both bits.
[ That said, is_cow_mapping() only checks VM_SHARED, so if they are ever different, that's a potential source of confusion ]
Linus
On 09.08.22 22:14, Linus Torvalds wrote:
On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand david@redhat.com wrote:
/* But FOLL_FORCE has no effect on shared mappings */ if (vma->vm_flags & MAP_SHARED) return false;
I'd actually rather check for MAP_MAYSHARE here, which is even stronger. Thoughts?
Hmm. Adding the test for both is basically free (all those vm_flags checks end up being a bit mask and compare), so no objections.
For some reason I though VM_SHARED and VM_MAYSHARE end up always being the same bits, and it was a mistake to make them two bits in the first place (unlike the read/write/exec bits that can are about mprotect),
But as there are two bits, I'm sure somebody ends up touching one and not the other.
So yeah, I don't see any downside to just checking both bits.
[ That said, is_cow_mapping() only checks VM_SHARED, so if they are ever different, that's a potential source of confusion ]
IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file mappings we only set VM_SHARED if the file allows for writes (and we can set VM_MAYWRITE or even VM_WRITE). [don't ask me why, it's a steady source of confusion]
That's why is_cow_mapping() works even due to the weird MAP_SHARED vs. VM_MAYSHARE logic.
I'd actually only check for VM_MAYSHARE here, which implies MAP_SHARED. If someone would ever mess that up, I guess we'd be in bigger trouble.
But whatever you prefer.
On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand david@redhat.com wrote:
IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file mappings we only set VM_SHARED if the file allows for writes
Heh.
This is a horrific hack, and probably should go away.
Yeah, we have that
if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
but I think that's _entirely_ historical.
Long long ago, in a galaxy far away, we didn't handle shared mmap() very well. In fact, we used to not handle it at all.
But nntpd would use write() to update the spool file, adn them read it through a shared mmap.
And since our mmap() *was* coherent with people doing write() system calls, but didn't handle actual dirty shared mmap, what Linux used to do was to just say "Oh, you want a read-only shared file mmap? I can do that - I'll just downgrade it to a read-only _private_ mapping, and it actually ends up with the same semantics".
And here we are, 30 years later, and it still does that, but it leaves the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a shared mapping.
Linus
On Tue, Aug 9, 2022 at 1:30 PM Linus Torvalds torvalds@linux-foundation.org wrote:
And here we are, 30 years later, and it still does that, but it leaves the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a shared mapping.
.. thinking about it, we end up still having some things that this helps.
For example, because we clear the VM_SHARED flags for read-only shared mappings, they don't end up going through mapping_{un}map_writable(), and don't update i_mmap_writable, and don't cause issues with mapping_deny_writable() or mapping_writably_mapped().
So it ends up actually having random small semantic details due to those almost three decades of history.
I'm sure there are other odd pieces like that.
Linus
On 09.08.22 22:30, Linus Torvalds wrote:
On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand david@redhat.com wrote:
IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file mappings we only set VM_SHARED if the file allows for writes
Heh.
This is a horrific hack, and probably should go away.
Yeah, we have that
if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
but I think that's _entirely_ historical.
Long long ago, in a galaxy far away, we didn't handle shared mmap() very well. In fact, we used to not handle it at all.
But nntpd would use write() to update the spool file, adn them read it through a shared mmap.
And since our mmap() *was* coherent with people doing write() system calls, but didn't handle actual dirty shared mmap, what Linux used to do was to just say "Oh, you want a read-only shared file mmap? I can do that - I'll just downgrade it to a read-only _private_ mapping, and it actually ends up with the same semantics".
And here we are, 30 years later, and it still does that, but it leaves the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a shared mapping.
I was suspecting that this code is full of legacy :)
What would make sense to me is to just have VM_SHARED and make it correspond to MAP_SHARED, that would at least confuse me less. Once I have some spare cycles I'll see how easy that might be to achieve.
On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds torvalds@linux-foundation.org wrote:
But as there are two bits, I'm sure somebody ends up touching one and not the other.
Ugh. The nommu code certainly does odd things here. That just looks wrong.
And the hugetlb code does something horrible too, but never *sets* it, so it looks like some odd subset of VM_SHARED.
Linus
On 09.08.22 22:20, Linus Torvalds wrote:
On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds torvalds@linux-foundation.org wrote:
But as there are two bits, I'm sure somebody ends up touching one and not the other.
Ugh. The nommu code certainly does odd things here. That just looks wrong.
And the hugetlb code does something horrible too, but never *sets* it, so it looks like some odd subset of VM_SHARED.
mm/mmap.c:do_mmap() contains the magic bit
switch (flags & MAP_TYPE) { case MAP_SHARED: ... case MAP_SHARED_VALIDATE: ... vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); fallthrough;
VM_SHARED semantics are confusing.
linux-stable-mirror@lists.linaro.org