While patching the .BTF_ids section in vmlinux, resolve_btfids writes type
ids using host-native endianness, and relies on libelf for any required
translation when finally updating vmlinux. However, the default type of the
.BTF_ids section content is ELF_T_BYTE (i.e. unsigned char), and undergoes
no translation. This results in incorrect patched values if cross-compiling
to non-native endianness, and can manifest as kernel Oops and test failures
which are difficult to debug.
Explicitly set the type of patched data to ELF_T_WORD, allowing libelf to
transparently handle the endian conversions.
Fixes: fbbb68de80a4 ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object")
Cc: stable(a)vger.kernel.org # v5.10+
Cc: Jiri Olsa <jolsa(a)kernel.org>
Cc: Yonghong Song <yhs(a)fb.com>
Link: https://lore.kernel.org/bpf/CAPGftE_eY-Zdi3wBcgDfkz_iOr1KF10n=9mJHm1_a_Pykc…
Signed-off-by: Tony Ambardar <Tony.Ambardar(a)gmail.com>
---
tools/bpf/resolve_btfids/main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/bpf/resolve_btfids/main.c b/tools/bpf/resolve_btfids/main.c
index d636643ddd35..f32c059fbfb4 100644
--- a/tools/bpf/resolve_btfids/main.c
+++ b/tools/bpf/resolve_btfids/main.c
@@ -649,6 +649,9 @@ static int symbols_patch(struct object *obj)
if (sets_patch(obj))
return -1;
+ /* Set type to ensure endian translation occurs. */
+ obj->efile.idlist->d_type = ELF_T_WORD;
+
elf_flagdata(obj->efile.idlist, ELF_C_SET, ELF_F_DIRTY);
err = elf_update(obj->efile.elf, ELF_C_WRITE);
--
2.25.1
try_grab_compound_head() is used to grab a reference to a page from
get_user_pages_fast(), which is only protected against concurrent
freeing of page tables (via local_irq_save()), but not against
concurrent TLB flushes, freeing of data pages, or splitting of compound
pages.
Because no reference is held to the page when try_grab_compound_head()
is called, the page may have been freed and reallocated by the time its
refcount has been elevated; therefore, once we're holding a stable
reference to the page, the caller re-checks whether the PTE still points
to the same page (with the same access rights).
The problem is that try_grab_compound_head() has to grab a reference on
the head page; but between the time we look up what the head page is and
the time we actually grab a reference on the head page, the compound
page may have been split up (either explicitly through split_huge_page()
or by freeing the compound page to the buddy allocator and then
allocating its individual order-0 pages).
If that happens, get_user_pages_fast() may end up returning the right
page but lifting the refcount on a now-unrelated page, leading to
use-after-free of pages.
To fix it:
Re-check whether the pages still belong together after lifting the
refcount on the head page.
Move anything else that checks compound_head(page) below the refcount
increment.
This can't actually happen on bare-metal x86 (because there, disabling
IRQs locks out remote TLB flushes), but it can happen on virtualized x86
(e.g. under KVM) and probably also on arm64. The race window is pretty
narrow, and constantly allocating and shattering hugepages isn't exactly
fast; for now I've only managed to reproduce this in an x86 KVM guest with
an artificially widened timing window (by adding a loop that repeatedly
calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits,
so that PV TLB flushes are used instead of IPIs).
As requested on the list, also replace the existing VM_BUG_ON_PAGE()
with a warning and bailout. Since the existing code only performed the
BUG_ON check on DEBUG_VM kernels, ensure that the new code also only
performs the check under that configuration - I don't want to mix two
logically separate changes together too much.
The macro VM_WARN_ON_ONCE_PAGE() doesn't return a value on !DEBUG_VM,
so wrap the whole check in an #ifdef block.
An alternative would be to change the VM_WARN_ON_ONCE_PAGE() definition
for !DEBUG_VM such that it always returns false, but since that would
differ from the behavior of the normal WARN macros, it might be too
confusing for readers.
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: stable(a)vger.kernel.org
Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
Signed-off-by: Jann Horn <jannh(a)google.com>
---
mm/gup.c | 58 +++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 43 insertions(+), 15 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 3ded6a5f26b2..90262e448552 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -43,8 +43,25 @@ static void hpage_pincount_sub(struct page *page, int refs)
atomic_sub(refs, compound_pincount_ptr(page));
}
+/* Equivalent to calling put_page() @refs times. */
+static void put_page_refs(struct page *page, int refs)
+{
+#ifdef CONFIG_DEBUG_VM
+ if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page))
+ return;
+#endif
+
+ /*
+ * Calling put_page() for each ref is unnecessarily slow. Only the last
+ * ref needs a put_page().
+ */
+ if (refs > 1)
+ page_ref_sub(page, refs - 1);
+ put_page(page);
+}
+
/*
* Return the compound head page with ref appropriately incremented,
* or NULL if that failed.
*/
@@ -55,8 +72,23 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
if (WARN_ON_ONCE(page_ref_count(head) < 0))
return NULL;
if (unlikely(!page_cache_add_speculative(head, refs)))
return NULL;
+
+ /*
+ * At this point we have a stable reference to the head page; but it
+ * could be that between the compound_head() lookup and the refcount
+ * increment, the compound page was split, in which case we'd end up
+ * holding a reference on a page that has nothing to do with the page
+ * we were given anymore.
+ * So now that the head page is stable, recheck that the pages still
+ * belong together.
+ */
+ if (unlikely(compound_head(page) != head)) {
+ put_page_refs(head, refs);
+ return NULL;
+ }
+
return head;
}
/*
@@ -94,25 +126,28 @@ __maybe_unused struct page *try_grab_compound_head(struct page *page,
if (unlikely((flags & FOLL_LONGTERM) &&
!is_pinnable_page(page)))
return NULL;
+ /*
+ * CAUTION: Don't use compound_head() on the page before this
+ * point, the result won't be stable.
+ */
+ page = try_get_compound_head(page, refs);
+ if (!page)
+ return NULL;
+
/*
* When pinning a compound page of order > 1 (which is what
* hpage_pincount_available() checks for), use an exact count to
* track it, via hpage_pincount_add/_sub().
*
* However, be sure to *also* increment the normal page refcount
* field at least once, so that the page really is pinned.
*/
- if (!hpage_pincount_available(page))
- refs *= GUP_PIN_COUNTING_BIAS;
-
- page = try_get_compound_head(page, refs);
- if (!page)
- return NULL;
-
if (hpage_pincount_available(page))
hpage_pincount_add(page, refs);
+ else
+ page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));
mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
orig_refs);
@@ -134,16 +169,9 @@ static void put_compound_head(struct page *page, int refs, unsigned int flags)
else
refs *= GUP_PIN_COUNTING_BIAS;
}
- VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
- /*
- * Calling put_page() for each ref is unnecessarily slow. Only the last
- * ref needs a put_page().
- */
- if (refs > 1)
- page_ref_sub(page, refs - 1);
- put_page(page);
+ put_page_refs(page, refs);
}
/**
* try_grab_page() - elevate a page's refcount by a flag-dependent amount
base-commit: 614124bea77e452aa6df7a8714e8bc820b489922
--
2.32.0.272.g935e593368-goog
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
The trace_clock_global() tries to make sure the events between CPUs is
somewhat in order. A global value is used and updated by the latest read
of a clock. If one CPU is ahead by a little, and is read by another CPU, a
lock is taken, and if the timestamp of the other CPU is behind, it will
simply use the other CPUs timestamp.
The lock is also only taken with a "trylock" due to tracing, and strange
recursions can happen. The lock is not taken at all in NMI context.
In the case where the lock is not able to be taken, the non synced
timestamp is returned. But it will not be less than the saved global
timestamp.
The problem arises because when the time goes "backwards" the time
returned is the saved timestamp plus 1. If the lock is not taken, and the
plus one to the timestamp is returned, there's a small race that can cause
the time to go backwards!
CPU0 CPU1
---- ----
trace_clock_global() {
ts = clock() [ 1000 ]
trylock(clock_lock) [ success ]
global_ts = ts; [ 1000 ]
<interrupted by NMI>
trace_clock_global() {
ts = clock() [ 999 ]
if (ts < global_ts)
ts = global_ts + 1 [ 1001 ]
trylock(clock_lock) [ fail ]
return ts [ 1001]
}
unlock(clock_lock);
return ts; [ 1000 ]
}
trace_clock_global() {
ts = clock() [ 1000 ]
if (ts < global_ts) [ false 1000 == 1000 ]
trylock(clock_lock) [ success ]
global_ts = ts; [ 1000 ]
unlock(clock_lock)
return ts; [ 1000 ]
}
The above case shows to reads of trace_clock_global() on the same CPU, but
the second read returns one less than the first read. That is, time when
backwards, and this is not what is allowed by trace_clock_global().
This was triggered by heavy tracing and the ring buffer checker that tests
for the clock going backwards:
Ring buffer clock went backwards: 20613921464 -> 20613921463
------------[ cut here ]------------
WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0
Modules linked in:
[..]
[CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463
[20613915818] PAGE TIME STAMP
[20613915818] delta:0
[20613915819] delta:1
[20613916035] delta:216
[20613916465] delta:430
[20613916575] delta:110
[20613916749] delta:174
[20613917248] delta:499
[20613917333] delta:85
[20613917775] delta:442
[20613917921] delta:146
[20613918321] delta:400
[20613918568] delta:247
[20613918768] delta:200
[20613919306] delta:538
[20613919353] delta:47
[20613919980] delta:627
[20613920296] delta:316
[20613920571] delta:275
[20613920862] delta:291
[20613921152] delta:290
[20613921464] delta:312
[20613921464] delta:0 TIME EXTEND
[20613921464] delta:0
This happened more than once, and always for an off by one result. It also
started happening after commit aafe104aa9096 was added.
Cc: stable(a)vger.kernel.org
Fixes: aafe104aa9096 ("tracing: Restructure trace_clock_global() to never block")
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
kernel/trace/trace_clock.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_clock.c b/kernel/trace/trace_clock.c
index c1637f90c8a3..4702efb00ff2 100644
--- a/kernel/trace/trace_clock.c
+++ b/kernel/trace/trace_clock.c
@@ -115,9 +115,9 @@ u64 notrace trace_clock_global(void)
prev_time = READ_ONCE(trace_clock_struct.prev_time);
now = sched_clock_cpu(this_cpu);
- /* Make sure that now is always greater than prev_time */
+ /* Make sure that now is always greater than or equal to prev_time */
if ((s64)(now - prev_time) < 0)
- now = prev_time + 1;
+ now = prev_time;
/*
* If in an NMI context then dont risk lockups and simply return
@@ -131,7 +131,7 @@ u64 notrace trace_clock_global(void)
/* Reread prev_time in case it was already updated */
prev_time = READ_ONCE(trace_clock_struct.prev_time);
if ((s64)(now - prev_time) < 0)
- now = prev_time + 1;
+ now = prev_time;
trace_clock_struct.prev_time = now;
--
2.30.2
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
A while ago, when the "trace" file was opened, tracing was stopped, and
code was added to stop recording the comms to saved_cmdlines, for mapping
of the pids to the task name.
Code has been added that only records the comm if a trace event occurred,
and there's no reason to not trace it if the trace file is opened.
Cc: stable(a)vger.kernel.org
Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred")
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
kernel/trace/trace.c | 9 ---------
1 file changed, 9 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e220b37e29c6..d23a09d3eb37 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2198,9 +2198,6 @@ struct saved_cmdlines_buffer {
};
static struct saved_cmdlines_buffer *savedcmd;
-/* temporary disable recording */
-static atomic_t trace_record_taskinfo_disabled __read_mostly;
-
static inline char *get_saved_cmdlines(int idx)
{
return &savedcmd->saved_cmdlines[idx * TASK_COMM_LEN];
@@ -3996,9 +3993,6 @@ static void *s_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-EBUSY);
#endif
- if (!iter->snapshot)
- atomic_inc(&trace_record_taskinfo_disabled);
-
if (*pos != iter->pos) {
iter->ent = NULL;
iter->cpu = 0;
@@ -4041,9 +4035,6 @@ static void s_stop(struct seq_file *m, void *p)
return;
#endif
- if (!iter->snapshot)
- atomic_dec(&trace_record_taskinfo_disabled);
-
trace_access_unlock(iter->cpu_file);
trace_event_read_unlock();
}
--
2.30.2
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
The saved_cmdlines is used to map pids to the task name, such that the
output of the tracing does not just show pids, but also gives a human
readable name for the task.
If the name is not mapped, the output looks like this:
<...>-1316 [005] ...2 132.044039: ...
Instead of this:
gnome-shell-1316 [005] ...2 132.044039: ...
The names are updated when tracing is running, but are skipped if tracing
is stopped. Unfortunately, this stops the recording of the names if the
top level tracer is stopped, and not if there's other tracers active.
The recording of a name only happens when a new event is written into a
ring buffer, so there is no need to test if tracing is on or not. If
tracing is off, then no event is written and no need to test if tracing is
off or not.
Remove the check, as it hides the names of tasks for events in the
instance buffers.
Cc: stable(a)vger.kernel.org
Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred")
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
kernel/trace/trace.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9299057feb56..e220b37e29c6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2486,8 +2486,6 @@ static bool tracing_record_taskinfo_skip(int flags)
{
if (unlikely(!(flags & (TRACE_RECORD_CMDLINE | TRACE_RECORD_TGID))))
return true;
- if (atomic_read(&trace_record_taskinfo_disabled) || !tracing_is_on())
- return true;
if (!__this_cpu_read(trace_taskinfo_save))
return true;
return false;
--
2.30.2