From: Steven Rostedt rostedt@goodmis.org
The persistent ring buffer can live across boots. It is expected that the content in the buffer can be translated to the current kernel with delta offsets even with KASLR enabled. But it can only guarantee this if the content of the ring buffer came from the same kernel as the one that is currently running.
Add uname into the meta data and if the uname in the meta data from the previous boot does not match the uname of the current boot, then clear the buffer and re-initialize it.
This only handles the case of kernel versions. It does not clear the buffer for development. There's several mechanisms to keep bad data from crashing the kernel. The worse that can happen is some corrupt data may be displayed.
Cc: stable@vger.kernel.org Fixes: 8f3e6659656e6 ("ring-buffer: Save text and data locations in mapped meta data") Signed-off-by: Steven Rostedt (Google) rostedt@goodmis.org --- kernel/trace/ring_buffer.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 7e257e855dd1..3c94c59d000c 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -17,6 +17,7 @@ #include <linux/uaccess.h> #include <linux/hardirq.h> #include <linux/kthread.h> /* for self test */ +#include <linux/utsname.h> #include <linux/module.h> #include <linux/percpu.h> #include <linux/mutex.h> @@ -45,10 +46,13 @@ static void update_pages_handler(struct work_struct *work);
#define RING_BUFFER_META_MAGIC 0xBADFEED +#define UNAME_SZ 64
struct ring_buffer_meta { int magic; int struct_size; + char uname[UNAME_SZ]; + unsigned long text_addr; unsigned long data_addr; unsigned long first_buffer; @@ -1687,6 +1691,11 @@ static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu, return false; }
+ if (strncmp(init_utsname()->release, meta->uname, UNAME_SZ - 1)) { + pr_info("Ring buffer boot meta[%d] mismatch of uname\n", cpu); + return false; + } + /* The subbuffer's size and number of subbuffers must match */ if (meta->subbuf_size != subbuf_size || meta->nr_subbufs != nr_pages + 1) { @@ -1920,6 +1929,7 @@ static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
meta->magic = RING_BUFFER_META_MAGIC; meta->struct_size = sizeof(*meta); + strscpy(meta->uname, init_utsname()->release, UNAME_SZ);
meta->nr_subbufs = nr_pages + 1; meta->subbuf_size = PAGE_SIZE;
On Tue, 17 Dec 2024 at 09:34, Steven Rostedt rostedt@goodmis.org wrote:
Add uname into the meta data and if the uname in the meta data from the previous boot does not match the uname of the current boot, then clear the buffer and re-initialize it.
This seems broken.
The problem is not that the previous boot data is wrong.
The problem is that you printed it *out* wrong by trying to interpret pointers in it.
Now you basically hide that, and make it harder to see any data from a bad kernel (since you presumably need to boot into a good kernel to do analysis).
The real fix seems to have been your 3/3, which still prints out the data, but stops trying to interpret the pointers in it.
Except you should also remove the last_text_delta / last_data_delta stuff. That's all about exactly that "trying to interpret bogus pointers". Instead you seem to have actually just *added* a case of that in your 3/3.
Linus
On Tue, 17 Dec 2024 09:46:30 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Dec 2024 at 09:34, Steven Rostedt rostedt@goodmis.org wrote:
Add uname into the meta data and if the uname in the meta data from the previous boot does not match the uname of the current boot, then clear the buffer and re-initialize it.
This seems broken.
BTW, this isn't the fix for the trace_check_vprintf(). That is here:
https://lore.kernel.org/linux-trace-kernel/20241217024118.587584221@goodmis....
Where the I completely remove that function.
The problem is not that the previous boot data is wrong.
The problem is that you printed it *out* wrong by trying to interpret pointers in it.
Now you basically hide that, and make it harder to see any data from a bad kernel (since you presumably need to boot into a good kernel to do analysis).
The real fix seems to have been your 3/3, which still prints out the data, but stops trying to interpret the pointers in it.
Except you should also remove the last_text_delta / last_data_delta stuff. That's all about exactly that "trying to interpret bogus pointers". Instead you seem to have actually just *added* a case of that in your 3/3.
I'm not sure what you mean. If the kernels are the same, then the pointers should also be the same, as KASLR just shifts them. This no longer uses module code. It only traces core kernel code which should always have the same offsets.
With the last_text_delta (and last_data_delta is always the same, but I added it in the case that ever changes), the shifts have always been accurate. For example, for RCU event strings.
With the pointer update to print_fields() I get:
<...>-912 [001] d..1. 304.817710: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch) <...>-912 [001] d..1. 304.817713: rcu_utilization: s=(0xffffffff9c4ca779:End context switch) <idle>-0 [002] dN.1. 304.817769: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch) <idle>-0 [002] dN.1. 304.817772: rcu_utilization: s=(0xffffffff9c4ca779:End context switch) <...>-19 [002] d..1. 304.817805: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch) <...>-19 [002] d..1. 304.817806: rcu_utilization: s=(0xffffffff9c4ca779:End context switch) <idle>-0 [006] dN.1. 304.817819: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch) <idle>-0 [006] dN.1. 304.817821: rcu_utilization: s=(0xffffffff9c4ca779:End context switch) <...>-902 [006] d.h1. 304.817901: rcu_utilization: s=(0xffffffff9c488fdd:Start scheduler-tick) <...>-902 [006] d.h1. 304.817903: rcu_utilization: s=(0xffffffff9c426d45:End scheduler-tick) <idle>-0 [007] dN.1. 304.817913: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch) <idle>-0 [007] dN.1. 304.817915: rcu_utilization: s=(0xffffffff9c4ca779:End context switch) <...>-18 [007] d..1. 304.817931: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch) <...>-18 [007] d..1. 304.817931: rcu_utilization: s=(0xffffffff9c4ca779:End context switch) <...>-902 [006] ..s1. 304.817941: rcu_utilization: s=(0xffffffff9c470f20:Start RCU core) <...>-902 [006] ..s1. 304.817958: rcu_utilization: s=(0xffffffff9c419bb8:End RCU core) <idle>-0 [002] dN.1. 304.818003: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch) <idle>-0 [002] dN.1. 304.818003: rcu_utilization: s=(0xffffffff9c4ca779:End context switch)
Without that calculation, all I get is garbage and completely useless:
<...>-903 [000] d..1. 24.712788: rcu_utilization: s=(0xffffffff970026af) <...>-903 [000] d..1. 24.712791: rcu_utilization: s=(0xffffffff970ca75b) <idle>-0 [004] dN.1. 24.712796: rcu_utilization: s=(0xffffffff970026af) <idle>-0 [004] dN.1. 24.712799: rcu_utilization: s=(0xffffffff970ca75b) <...>-19 [004] d..1. 24.712825: rcu_utilization: s=(0xffffffff970026af) <...>-19 [004] d..1. 24.712825: rcu_utilization: s=(0xffffffff970ca75b) <...>-18 [004] d..1. 24.712838: rcu_utilization: s=(0xffffffff970026af) <...>-18 [004] d..1. 24.712839: rcu_utilization: s=(0xffffffff970ca75b) <idle>-0 [005] dN.1. 24.712881: rcu_utilization: s=(0xffffffff970026af) <idle>-0 [005] dN.1. 24.712883: rcu_utilization: s=(0xffffffff970ca75b) <...>-893 [005] d.h1. 24.712911: rcu_utilization: s=(0xffffffff97088feb) <...>-893 [005] d.h1. 24.712912: rcu_utilization: s=(0xffffffff97026d3f) <...>-893 [005] ..s1. 24.712942: rcu_utilization: s=(0xffffffff97070f2e) <...>-893 [005] ..s1. 24.712944: rcu_utilization: s=(0xffffffff97019bb2) <...>-893 [005] dN.1. 24.713412: rcu_utilization: s=(0xffffffff970026af) <...>-893 [005] dN.1. 24.713412: rcu_utilization: s=(0xffffffff970ca75b)
The delta calculations are done by saving an address of a symbol into the meta data, and when the meta data is considered a match, it calculates the delta between what was saved in the meta data to the same symbol in the current kernel.
This works across several reboots too. That is, I can save the boot mapped data, reboot several times, and still see the correct data in the ring buffer, as the delta is calculated at each boot. It's only saved when tracing starts and the ring buffer is re-initialized.
So what exactly are you saying is broken?
-- Steve
On Tue, 17 Dec 2024 at 10:04, Steven Rostedt rostedt@goodmis.org wrote:
I'm not sure what you mean. If the kernels are the same, then the pointers should also be the same, as KASLR just shifts them. This no longer uses module code. It only traces core kernel code which should always have the same offsets.
(a) the shifting is what caused problems with you having to hack round the format string and %pS.
(b) the data addresses are more than shifted, so that "data_delta" is *completely* bogus
(c) neither of them are AT ALL valid for modules regardless
Stop using the delta fields. They are broken. Even for the same kernel. It's literally a "I made sh*t up and it sometimes works" situation.
That "sometimes works" is not how we do kernel development. Stop it.
What *woiuld* have been an acceptable model is to actually modify the boot-time buffers in place, using actual real heuristics that look at whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section of the previous boot.
But you never did that. All this delta code has always been complete and utter garbage, and complete hacks.
Remove it.
Then, if at some point you can do it *right* (see above), maybe you can try to re-introduce it. But the current delta code is pure and utter garbage and needs to die. No more of this "hacking shit up to make it sometimes work".
Linus
On Tue, 17 Dec 2024 at 10:19, Linus Torvalds torvalds@linux-foundation.org wrote:
What *woiuld* have been an acceptable model is to actually modify the boot-time buffers in place, using actual real heuristics that look at whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section of the previous boot.
But you never did that. All this delta code has always been complete and utter garbage, and complete hacks.
Actually, I think the proper model isn't even that "modify boot time buffers in place" thing.
The proper model was probably always to just do the "give the raw data, and analyze the previous boot data in user mode". Don't do "delta between old and new kernel", print out the actual KASLR base of the old kernel, and let user mode figure it out. Because user mode may actually have the old and different vmlinux image too - something that kernel mode *never* has.
Linus
On Tue, 17 Dec 2024 10:24:42 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Dec 2024 at 10:19, Linus Torvalds torvalds@linux-foundation.org wrote:
What *woiuld* have been an acceptable model is to actually modify the boot-time buffers in place, using actual real heuristics that look at whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section of the previous boot.
But you never did that. All this delta code has always been complete and utter garbage, and complete hacks.
Actually, I think the proper model isn't even that "modify boot time buffers in place" thing.
The proper model was probably always to just do the "give the raw data, and analyze the previous boot data in user mode". Don't do "delta between old and new kernel", print out the actual KASLR base of the old kernel, and let user mode figure it out. Because user mode may actually have the old and different vmlinux image too - something that kernel mode *never* has.
I already support this somewhat, as the text_delta (and data_delta) are presented in the tracefs directory so that trace-cmd can parse it.
For my use case, this would work, as we are extracting the raw data and need to do most the processing in user space anyway. I could have it export the KASLR offset of the previous boot and then trace-cmd should be able to handle it fine if the events and kallsyms of the previous boot are all saved. It could easily put things back together, including modules and dynamic events.
This will just make it useless for those that want to use the tracefs directly.
-- Steve
On 2024-12-17 13:24, Linus Torvalds wrote:
On Tue, 17 Dec 2024 at 10:19, Linus Torvalds torvalds@linux-foundation.org wrote:
What *woiuld* have been an acceptable model is to actually modify the boot-time buffers in place, using actual real heuristics that look at whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section of the previous boot.
But you never did that. All this delta code has always been complete and utter garbage, and complete hacks.
Actually, I think the proper model isn't even that "modify boot time buffers in place" thing.
The proper model was probably always to just do the "give the raw data, and analyze the previous boot data in user mode".
It appears that you just summarized the LTTng (out-of-tree) kernel tracer [1] model in one short sentence.
If this can help in some way, within the LTTng model, here is how we're solving the problem of mapping addresses to symbols:
1- We have a statedump infrastructure, which dumps internal kernel state. It could dump the kernel and each module base addresses into the trace. (we do it for userspace tracing)
2- We can hook on module load/unload to insert event about insertion and removal of those base addresses into the trace buffers. (we do it for userspace tracing)
3- We augment the traces at post-processing with DWARF and ELF parsers in Babeltrace [2] to augment the trace with symbolic information using the ELF or DWARF files as inputs in addition to the traces. (this already exists, and is used for userspace traces)
4- We already have the integration of the LTTng Userspace tracer with PMEM and DAX to recover traces after a machine crash. Those buffers are self-described with an ABI which allows a userspace tool (lttng-crash) to extract well-formed Common Trace Format [3] traces from the buffers after reboot. We've never had the incentive to port this facility to the kernel tracer so far though.
Thanks,
Mathieu
[1] https://lttng.org [2] https://babeltrace.org [3] https://diamon.org/ctf
On Tue, 17 Dec 2024 10:19:45 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Dec 2024 at 10:04, Steven Rostedt rostedt@goodmis.org wrote:
I'm not sure what you mean. If the kernels are the same, then the pointers should also be the same, as KASLR just shifts them. This no longer uses module code. It only traces core kernel code which should always have the same offsets.
(a) the shifting is what caused problems with you having to hack round the format string and %pS.
How else do I get the function name?
(b) the data addresses are more than shifted, so that "data_delta" is *completely* bogus
The data_delta and text_delta are equal, and I could get rid of delta_data, as I haven't used it. I only used the text_delta as they are always the same.
(c) neither of them are AT ALL valid for modules regardless
I can make sure that it only works for core kernel code, and print the raw address if it isn't.
Stop using the delta fields. They are broken. Even for the same kernel. It's literally a "I made sh*t up and it sometimes works" situation.
That "sometimes works" is not how we do kernel development. Stop it.
For core kernel code it *always works*. I haven't seen it fail yet.
This is the point of this patch series, is to remove the cases where it can fail. That is, if the kernel isn't the same, or the use of modules and dynamic events that are not stable across reboots.
But for the core kernel code, I have not seen it fail once!
What *woiuld* have been an acceptable model is to actually modify the boot-time buffers in place, using actual real heuristics that look at whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section of the previous boot.
So basically, you want a full parser of the trace event code that reads the boot time buffer and makes it match the current kernel?
But you never did that. All this delta code has always been complete and utter garbage, and complete hacks.
Remove it.
Then, if at some point you can do it *right* (see above), maybe you can try to re-introduce it. But the current delta code is pure and utter garbage and needs to die. No more of this "hacking shit up to make it sometimes work".
As I said. It doesn't "sometimes work" it "always works", at least for the core kernel. And I agree that it shouldn't "sometimes work" which is why this patch series sets out to remove those cases that do not work.
-- Steve
On Tue, 17 Dec 2024 at 10:32, Steven Rostedt rostedt@goodmis.org wrote:
How else do I get the function name?
My initial suggestion was to just fix up the boot time array.
I think that's actually wrong. Just print the raw data and analyze it in user space.
I can make sure that it only works for core kernel code, and print the raw address if it isn't.
Streven, STOP HACKING AROUND GARBAGE.
Your solution to "I had a bad idea that resulted in bad code" seems to always be "write more bad code".
STOP IT.
Really. This is literally what I started the whole original complaint about. Go back to my original email, and try to understand the original issue. Let me quote the really relevant part of that email again:
This stuff is full of crazy special cases for things that should never be done in the first place.
Note - and really INTERNALIZE - that "for stuff that should never be done in the first place".
You started with the wrong design. Then you keep hacking it up, and the hacks just get wilder and crazier as you notice there are more special cases.
This is now getting to the point where I'm considering just ripping out the whole boot-time previous kernel buffer crap because you seem to have turned an interesting idea into just a morass of problems.
Your choice: get rid of the crazy, or have me rip it out.
Linus
On Tue, 17 Dec 2024 10:42:47 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
Really. This is literally what I started the whole original complaint about. Go back to my original email, and try to understand the original issue. Let me quote the really relevant part of that email again:
This stuff is full of crazy special cases for things that should never be done in the first place.
To my defense, all of ftrace (function tracing, live kernel patching, etc) came from trying to do crazy special cases. The TRACE_EVENT() macro being another one ;-)
Note - and really INTERNALIZE - that "for stuff that should never be done in the first place".
You started with the wrong design. Then you keep hacking it up, and the hacks just get wilder and crazier as you notice there are more special cases.
This is now getting to the point where I'm considering just ripping out the whole boot-time previous kernel buffer crap because you seem to have turned an interesting idea into just a morass of problems.
Your choice: get rid of the crazy, or have me rip it out.
Point taken. And for my work use case, I can keep the ring buffer as is and do most the work in user space. I did these "hacks" for those that do not use trace-cmd and still use the tracefs file system directly (like a lot of the embedded folks).
But instead, I'll replace the text/data_deltas with a kaslr offset (it will only be exported if the trace contains data from a previous kernel so not to export the current kaslr offset).
Then, on our production systems, we'll save the meta data of the events we enable (this can include module events as well as dynamic events) and after a crash, we'll extract the data along with the saved data stored on disk, and be able to recreate the entire trace.
I'll only push patch 3 without the %s pointer update.
-- Steve
On Tue, 17 Dec 2024 at 11:01, Steven Rostedt rostedt@goodmis.org wrote:
But instead, I'll replace the text/data_deltas with a kaslr offset (it will only be exported if the trace contains data from a previous kernel so not to export the current kaslr offset).
Right - never export the KASRL offset for the *current* kernel, but the same interface that exports the "previous kernel trace data" can certainly export the KASLR for that previous case.
Then, on our production systems, we'll save the meta data of the events we enable (this can include module events as well as dynamic events) and after a crash, we'll extract the data along with the saved data stored on disk, and be able to recreate the entire trace.
Yes. And if you save the module names and load addresses, you can now hopefully sort out things like %s (and %pS) from modules too.
Although maybe they don't happen that often?
Linus
On Tue, 17 Dec 2024 11:38:00 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Dec 2024 at 11:01, Steven Rostedt rostedt@goodmis.org wrote:
But instead, I'll replace the text/data_deltas with a kaslr offset (it will only be exported if the trace contains data from a previous kernel so not to export the current kaslr offset).
Right - never export the KASRL offset for the *current* kernel, but the same interface that exports the "previous kernel trace data" can certainly export the KASLR for that previous case.
But this will be future work and not something for this merge window, as it's more of a feature. The only fix is to add that print_field() code, and the patch series that removes trace_check_vprintf() (which fixes a different bug).
Then, on our production systems, we'll save the meta data of the events we enable (this can include module events as well as dynamic events) and after a crash, we'll extract the data along with the saved data stored on disk, and be able to recreate the entire trace.
Yes. And if you save the module names and load addresses, you can now hopefully sort out things like %s (and %pS) from modules too.
Although maybe they don't happen that often?
Actually, they do appear a bit. As the kmalloc trace event records the call sites that do allocation and many of them are in module code.
-- Steve
On Tue, 17 Dec 2024 at 11:43, Steven Rostedt rostedt@goodmis.org wrote:
But this will be future work and not something for this merge window, as it's more of a feature. The only fix is to add that print_field() code, and the patch series that removes trace_check_vprintf() (which fixes a different bug).
Side note: I've also been looking at the core vsnprintf() code to see if it could be cleanly extended to have callers give more control over it. Some kind of callback mechanism for the pointer handling so that there is *not* any need for crazy hackery.
That needs some basic cleanups I want to do and that I'm playing with, but even with some of that code cleaned up it looks a bit nasty.
I really don't want to expose too much of the internals to the outside, and vsnprintf() is actually fairly performance-critical, and indirect calls have become so expensive that I really don't want to introduce function pointers in there.
But I'll try to see if some more cleanups would make it at least possible to have a separate version. That said, we already have the disgusting "binary printf" code, used by bpf and tracing, and I'd hate to introduce a *third* interface, particularly since the binary printf code is already doing things wrong (it puts things into a "word array", but instead of using a single word for char/short like the normal C varargs code does, it packs them at actual byte/word boundaries and adds padding etc).
So just looking at that code, I'm not hugely excited about adding yet more copies of this and new interfaces in this area.
(Aside: are those binary buffers actually exported to user space (that "u32 *bin_buf, size_t size" thing), or could we fix the binary printf code to really use a whole word for char/short values? The difference between '%hhd' and '%d' should be how it's printed out, not how the data is accessed)
Linus
On Tue, 17 Dec 2024 14:24:08 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
(Aside: are those binary buffers actually exported to user space (that "u32 *bin_buf, size_t size" thing), or could we fix the binary printf code to really use a whole word for char/short values? The difference between '%hhd' and '%d' should be how it's printed out, not how the data is accessed)
libtraceevent is able to parse the raw trace_printk() events:
https://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git/tree/src/even...
The format it reads is from /sys/kernel/tracing/events/ftrace/bprint/format:
name: bprint ID: 6 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long ip; offset:8; size:8; signed:0; field:const char * fmt; offset:16; size:8; signed:0; field:u32 buf[]; offset:24; size:0; signed:0;
print fmt: "%ps: %s", (void *)REC->ip, REC->fmt
In this case, the "print fmt" is ignored.
Where the buf value holds the binary storage from vbin_printf() and written in trace_vbprintk(). Yeah, it looks like it does depend on the arguments being word aligned.
-- Steve
On Tue, 17 Dec 2024 at 14:52, Steven Rostedt rostedt@goodmis.org wrote:
Where the buf value holds the binary storage from vbin_printf() and written in trace_vbprintk(). Yeah, it looks like it does depend on the arguments being word aligned.
So the thing is, they actually aren't word-aligned at all.
Yes, the buffer is ostensibly individual words, and the buffer size is given in words, and 64-bit data is saved/fetched as two separate words.
But if you look more closely, you'll see that the way the buffer is managed is actually not as a word array at all, but using
char *str, *end;
instead of word pointers. And the reason is because it does
str = PTR_ALIGN(str, sizeof(type)); ... str += sizeof(type);
so byte-sized data is *not* given a word, it's only given a single char, and then if it's followed by an "int", the pointer will be aligned at that point.
It does mean that the binary buffers are a bit denser, but since %c and %hhd etc are VERY unusual (and often will be mixed with int-sized data), that denser format isn't commonly an actual advantage.
And the reason I noticed this? When I was trying to clean up and simplify the vsnprintf() code to not have so many different cases, the *regular* number handling for char/short/int-sized cases ends up being just one case that looks like this:
num = get_num(va_arg(args, int), fmt.state, spec);
because char/short/int are all just "va_arg(args, int)" values.
Then the actual printout depends on that printf_spec thing (and, in my experimental branch, that "fmt.state", but that's a local trial thing where I split the printf_spec differently for better code generation).
So basically the core vfsprintf case doesn't need to care about fetching char/short/int, because va_args always just formats those kinds of arguments as int, as per the usual C implicit type expansion rules.
But the binary printf thing then has to have three different cases, because unlike the normal calling convention, it actually packs char/short/int differently. So with all those nice cleanups I tried still look like this:
case FORMAT_STATE_2BYTE: num = get_num(get_arg(short), fmt.state, spec); break; case FORMAT_STATE_1BYTE: num = get_num(get_arg(char), fmt.state, spec); break; default: num = get_num(get_arg(int), fmt.state, spec); break;
which is admittedly still a lot better than the current situation in top-of-tree, which has separate versions for a *lot* more types.
So right now top-of-tree has 11 different enumerated cases for number printing:
FORMAT_TYPE_LONG_LONG, FORMAT_TYPE_ULONG, FORMAT_TYPE_LONG, FORMAT_TYPE_UBYTE, FORMAT_TYPE_BYTE, FORMAT_TYPE_USHORT, FORMAT_TYPE_SHORT, FORMAT_TYPE_UINT, FORMAT_TYPE_INT, FORMAT_TYPE_SIZE_T, FORMAT_TYPE_PTRDIFF
and in my test cleanup, I've cut this down to just four cases: FORMAT_STATE_xBYTE (x = 1, 2, 4, 8).
And the actual argument fetch for the *normal* case is actually just two cases (because the 8-byte and the "4 bytes or less" cases are different for va_list, but 1/2/4 bytes are just that single case).
But the binary printf argument save/fetch is not able to be cut down to just two cases because of how it does that odd "ostensibly a word array, but packed byte/short fields" thing.
Oh well. It's not a big deal. But I was doing this to see if I could regularize the vsnprintf() logic and make sharing better - and then just the binary version already causes unnecessary duplication.
If the *only* thing that accesses that word array is vbin_printf and bstr_printf, then I could just change the packing to act like va_list does (ie the word array would *actually* be a word array, and char and short values would get individual words).
But at least the bpf cde seems to know about the crazy packing, and also does that
tmp_buf = PTR_ALIGN(tmp_buf, sizeof(u32));
in bpf_bprintf_prepare() because it knows that it's not *actually* an array of words despite it being documented as such.
Of course, the bpf code only does the packed access thing for '%c', and doesn't seem to know that the same is true of '%hd' and '%hhd', presumably because nobody actually uses that.
Let's add Alexei to the participants. I think bpf too would actually prefer that the odd char/short packing *not* be done, if only because it clearly does the wrong thing as-is for non-%c argument (ie those %hd/%hhd cases).
Who else accesses that odd "binary printed pseudo-word array"?
Linus
On Tue, 17 Dec 2024 at 15:32, Linus Torvalds torvalds@linux-foundation.org wrote:
But if you look more closely, you'll see that the way the buffer is managed is actually not as a word array at all, but using
char *str, *end;
instead of word pointers.
Oh, and in addition to the smaller-than-int types ('%c' and '%hd' etc), pointers that get dereferenced also get written as a byte string to that word array. There might be other cases too.
So it's really a fairly odd kind of "sometimes words, sometimes not" array, with the size of the array given in words.
That binary printf is very strange.
Linus
On Tue, 17 Dec 2024 16:02:34 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Dec 2024 at 15:32, Linus Torvalds torvalds@linux-foundation.org wrote:
But if you look more closely, you'll see that the way the buffer is managed is actually not as a word array at all, but using
char *str, *end;
instead of word pointers.
Oh, and in addition to the smaller-than-int types ('%c' and '%hd' etc), pointers that get dereferenced also get written as a byte string to that word array. There might be other cases too.
So it's really a fairly odd kind of "sometimes words, sometimes not" array, with the size of the array given in words.
That binary printf is very strange.
Note that at least on the tracing user space side, as trace_printk() is never used in production systems and mostly just used for debugging, we can be pretty liberal if I have to change libtraceevent.
I could even add an update to the format file to have the library stay backward compatible with older kernels and can see that the format file has been updated to know that the vbin_printf() has changed.
-- Steve
On Tue, Dec 17, 2024 at 3:32 PM Linus Torvalds torvalds@linux-foundation.org wrote:
If the *only* thing that accesses that word array is vbin_printf and bstr_printf, then I could just change the packing to act like va_list does (ie the word array would *actually* be a word array, and char and short values would get individual words).
But at least the bpf cde seems to know about the crazy packing, and also does that
tmp_buf = PTR_ALIGN(tmp_buf, sizeof(u32));
in bpf_bprintf_prepare() because it knows that it's not *actually* an array of words despite it being documented as such.
Of course, the bpf code only does the packed access thing for '%c', and doesn't seem to know that the same is true of '%hd' and '%hhd', presumably because nobody actually uses that.
Let's add Alexei to the participants. I think bpf too would actually prefer that the odd char/short packing *not* be done, if only because it clearly does the wrong thing as-is for non-%c argument (ie those %hd/%hhd cases).
We reject %hd case as EINVAL and do byte copy for %c. All that was done as part of commit 48cac3f4a96d ("bpf: Implement formatted output helpers with bstr_printf") that cleaned things up greatly. The byte copy for %c wasn't deliberate to save space. Just happen to work with bstr_printf(). We can totally switch to u32 if that's the direction for bstr_printf. To handle %s we use bpf_trace_copy_string(tmp_buf, ) which does _nofault() underneath. Since the tmp_buf is byte packed because of strings the %c case just adds a byte too. Strings and %c can be made u32 aligned.
Since we're on this topic, Daniel is looking to reuse format_decode() in bpf_bprintf_prepare() to get rid of our manual format validation.
On Tue, 17 Dec 2024 at 16:47, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
Since we're on this topic, Daniel is looking to reuse format_decode() in bpf_bprintf_prepare() to get rid of our manual format validation.
That was literally why I started looking into this - the many separate type formats actually end up causing format_decode() (and the callers) to have to generate multiple different cases, which then in turn either cause a jump table, or - more commonly due to the CPU indirect branch mitigations - a chain of conditionals that are fairly ugly.
Compressing the state table for the types from 11 down to 4 types helps a bit, but then also dealing with the "smaller than int" things as just 'int' (with the formatting flags that are separate) also ends up avoiding some unnecessary and extra cases.
Because in the end, 'size_t' and 'long' are the same thing, even on architectures like 32-bit x86 where 'size_t' really is 'unsigned int' - simply because the only thing that matters for fetching the value is the size, which is 32-bit.
(The whole "is it signed" and the truncation to smaller-than-int etc is then something we have to handle anyway in by the 'printf_spec' thing).
So I have a patch series to clean some of this up and avoid the extra states. I'm not entirely happy with it, though, and I've been going back and forth on some of the code, so I'm not ready to post it or have anybody use it as a basis for some "real" cleanups.
I guess I could at least post the "turn 11 different types into 4" part. I have other things in there, but that part seems fairly unambiguously good.
Let me go separate that part out and maybe people can point out where I've done something silly.
Linus
On Tue, 17 Dec 2024 at 17:26, Linus Torvalds torvalds@linux-foundation.org wrote:
Let me go separate that part out and maybe people can point out where I've done something silly.
Ok, that part I had actually already locally separated out better than some of my later patches in the series, so I sent it out as
https://lore.kernel.org/all/20241218013620.1679088-1-torvalds@linux-foundati...
but I'm not guaranteeing it's right. Consider it a WIP thing, and only a first step.
Linus
On Tue, Dec 17, 2024 at 5:39 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Dec 2024 at 17:26, Linus Torvalds torvalds@linux-foundation.org wrote:
Let me go separate that part out and maybe people can point out where I've done something silly.
Ok, that part I had actually already locally separated out better than some of my later patches in the series, so I sent it out as
https://lore.kernel.org/all/20241218013620.1679088-1-torvalds@linux-foundati...
lgtm. Since bstr_printf() converts 1/2/4 to unsigned long long num with a sign according to the format specifier it's good from the calling convention perspective. Doesn't matter here, but anyone passing 32-bit ints around needs to be aware of odd riscv abi promotion rules. x86-64 and arm64 zero extend 32-bit ints while riscv does sign extension when s32 and u32 are passed. Quote from the spec: "In RV64, 32-bit types, such as int, are stored in integer registers as proper sign extensions of their 32-bit values; that is, bits 63..31 are all equal. This restriction holds even for unsigned 32-bit types."
On Tue, 17 Dec 2024 at 10:42, Linus Torvalds torvalds@linux-foundation.org wrote:
My initial suggestion was to just fix up the boot time array.
I think that's actually wrong. Just print the raw data and analyze it in user space.
.. I still think it's not the optimal solution, but fixing up the event data from the previous boot (*before* printing it, and entirely independently of vsnprintf()) would at least avoid the whole "mess with vsnprintf and switch the format string around as you are trying to walk the va_list in sync".
Because that was really a non-starter. Both the format string hackery and the va_list hackery was just fundamentally bogus.
If you massage the data before printing - and independently of it - those two issues should at least go away.
Linus
On Tue, 17 Dec 2024 11:03:28 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Dec 2024 at 10:42, Linus Torvalds torvalds@linux-foundation.org wrote:
My initial suggestion was to just fix up the boot time array.
I think that's actually wrong. Just print the raw data and analyze it in user space.
.. I still think it's not the optimal solution, but fixing up the event data from the previous boot (*before* printing it, and entirely independently of vsnprintf()) would at least avoid the whole "mess with vsnprintf and switch the format string around as you are trying to walk the va_list in sync".
Because that was really a non-starter. Both the format string hackery and the va_list hackery was just fundamentally bogus.
If you massage the data before printing - and independently of it - those two issues should at least go away.
But I can't massage the data without the deltas. That takes us back to having to have the same kernel and only processing kernel core data and ignoring modules.
-- Steve
On Tue, 17 Dec 2024 14:07:50 -0500 Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 17 Dec 2024 11:03:28 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Dec 2024 at 10:42, Linus Torvalds torvalds@linux-foundation.org wrote:
My initial suggestion was to just fix up the boot time array.
I think that's actually wrong. Just print the raw data and analyze it in user space.
.. I still think it's not the optimal solution, but fixing up the event data from the previous boot (*before* printing it, and entirely independently of vsnprintf()) would at least avoid the whole "mess with vsnprintf and switch the format string around as you are trying to walk the va_list in sync".
And that code that does the va_list and vsnprintf() tricks is going to be removed as soon as that patch set finishes going through my full test suite, and you are OK with the solution.
Again, that patch set is here:
https://lore.kernel.org/linux-trace-kernel/20241217024118.587584221@goodmis....
-- Steve
On Tue, 17 Dec 2024 09:46:30 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
Now you basically hide that, and make it harder to see any data from a bad kernel (since you presumably need to boot into a good kernel to do analysis).
Note, this isn't for debugging kernels that crash at boot. There's other methods to debug things like that. Like, ftrace_dump_on_oops. This is for debugging something that happens after boot or in the field. Where the "bad" kernel can still boot in a good state. The kernel should be good enough to boot into a state where you can extract the trace and send it offline somewhere.
I used this to debugging a few things in development, where someone hands me a reproducer. I enabled the boot_mapped buffer, run the reproducer, the system crashes and reboots into the same kernel. Then I was able to debug what happened, as long as I don't re-run the reproducer. But I did do several iterations of:
Start boot mapped tracing, run reproducer, crash, reboot, look at boot mapped trace. Restart boot mapped tracing with more events, run reproducer, crash, reboot, look at mapped trace, wash, rinse, repeat.
-- Steve
linux-stable-mirror@lists.linaro.org