On Thu, Sep 12, 2024, at 2:40 AM, Jesper Dangaard Brouer wrote:
On 11/09/2024 20.53, Daniel Xu wrote:
On Wed, Sep 11, 2024 at 10:32:56AM GMT, Jesper Dangaard Brouer wrote:
On 11/09/2024 06.43, Daniel Xu wrote:
[cc Jesper]
On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote:
On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote:
On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu dxu@dxuuu.xyz wrote: > > On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote: >> On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu dxu@dxuuu.xyz wrote: >>>
[...cut...]
Can you give us a bit more details on what you are trying to achieve?
BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each entry in the cpumap. When a prog redirects to an entry in the cpumap, the machinery queues up the xdp frame onto the destination CPU ptr_ring. This can occur on any cpu, thus multi-producer. On processing side, there is only the kthread created by the cpumap entry and bound to the specific cpu that is consuming entries. So single consumer.
An important detail: to get Multi-Producer (MP) to scale the CPUMAP does bulk enqueue into the ptr_ring. It stores the xdp_frame's in a per-CPU array and does the flush/enqueue as part of the xdp_do_flush(). Because I was afraid of this adding latency, I choose to also flush every 8 frames (CPU_MAP_BULK_SIZE).
Looking at code I see this is also explained in a comment:
/* General idea: XDP packets getting XDP redirected to another CPU,
- will maximum be stored/queued for one driver ->poll() call. It is
- guaranteed that queueing the frame and the flush operation happen on
- same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
- which queue in bpf_cpu_map_entry contains packets.
*/
Goal is to track the latency overhead added from ptr_ring and the kthread (versus softirq where is less overhead). Ideally we want p50, p90, p95, p99 percentiles.
I'm very interesting in this use-case of understanding the latency of CPUMAP. I'm a fan of latency histograms that I turn into heatmaps in grafana.
To do this, we need to track every single entry enqueue time as well as dequeue time - events that occur in the tail are quite important.
Since ptr_ring is also a ring buffer, I thought it would be easy, reliable, and fast to just create a "shadow" ring buffer. Every time producer enqueues entries, I'd enqueue the same number of current timestamp onto shadow RB. Same thing on consumer side, except dequeue and calculate timestamp delta.
This idea seems overkill and will likely produce unreliable results. E.g. the overhead of this additional ring buffer will also affect the measurements.
Yeah, good point.
I was originally planning on writing my own lockless ring buffer in pure BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I could avoid that with this patch.
[...]
Alternatively, could add a u64 timestamp to xdp_frame, which makes all this tracking inline (and thus more reliable). But I'm not sure how precious the space in that struct is - I see some references online saying most drivers save 128B headroom. I also see:
#define XDP_PACKET_HEADROOM 256
I like the inline idea. I would suggest to add u64 timestamp into XDP-metadata area (ctx->data_meta code example[1]) , when XDP runs in RX-NAPI. Then at the remote CPU you can run another CPUMAP-XDP program that pickup this timestamp, and then calc a delta from "now" timestamp.
[1] https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/a...
Cool! This is a much better idea than mine :)
I'll give this a try.
Could probably amortize the timestamp read by setting it in bq_flush_to_queue().
To amortize, consider that you might not need to timestamp EVERY packet to get sufficient statistics on the latency.
Regarding bq_flush_to_queue() and the enqueue tracepoint: trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu)
I have an idea for you, on how to measure the latency overhead from XDP RX-processing to when enqueue "flush" happens. It is a little tricky to explain, so I will outline the steps.
XDP bpf_prog store timestamp in per-CPU array, unless timestamp is already set.
trace_xdp_cpumap_enqueue bpf_prog reads per-CPU timestamp and calc latency diff, and clears timestamp.
This measures the latency overhead of bulk enqueue. (Notice: Only the first XDP redirect frame after a bq_flush_to_queue() will set the timestamp). This per-CPU store should work as this all runs under same RX-NAPI "poll" execution.
Makes sense to me. This breaks down the latency even further. I'll keep it in mind if we need further troubleshooting.
Yes, this breaks down the latency even further :-)
This latency overhead of bulk enqueue, will (unfortunately) also count/measure the XDP_PASS packets that gets processed by the normal netstack. So, watch out for this. e.g could have XDP actions (e.g XDP_PASS) counters as part of step 1, and have statistic for cases where XDP_PASS interfered.
Not sure I got this. If we only set the percpu timestamp for XDP_REDIRECT frames, then I don't see how XDP_PASS interferes. Maybe I misunderstand something.
Not quite. You only set timestamp for the first XDP_REDIRECT frames. I'm simply saying, all the packets processed *after* this timestamp will attribute to the time it takes until trace_xdp_cpumap_enqueue() runs. That part should be obvious. Then, I'm saying remember that an XDP_PASS packet takes more time than a XDP_REDIRECT packet. I hope this makes it more clear. Point: This is a pitfall you need to be aware of when looking at your metrics.
For the inline timestamping same pitfall actually applies. There you timestamp the packets themselves. Because the CPUMAP enqueue happens as the *LAST* thing in NAPI loop, during xdp_do_flush() call. This means that interleaved normal netstack (XDP_PASS) packets will be processed BEFORE this call to xdp_do_flush(). As noted earlier, to compensate for this effect, code will enqueue-flush every 8 frames (CPU_MAP_BULK_SIZE).
Ah, that makes sense. Thanks for explaining.
I hope, I've not confused you more :-|
No, that was helpful!
Thanks, Daniel