On 11/09/2024 06.43, Daniel Xu wrote:
[cc Jesper]
On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote:
On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote:
On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu dxu@dxuuu.xyz wrote:
On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote:
On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu dxu@dxuuu.xyz wrote:
[...cut...]
Can you give us a bit more details on what you are trying to achieve?
BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each entry in the cpumap. When a prog redirects to an entry in the cpumap, the machinery queues up the xdp frame onto the destination CPU ptr_ring. This can occur on any cpu, thus multi-producer. On processing side, there is only the kthread created by the cpumap entry and bound to the specific cpu that is consuming entries. So single consumer.
An important detail: to get Multi-Producer (MP) to scale the CPUMAP does bulk enqueue into the ptr_ring. It stores the xdp_frame's in a per-CPU array and does the flush/enqueue as part of the xdp_do_flush(). Because I was afraid of this adding latency, I choose to also flush every 8 frames (CPU_MAP_BULK_SIZE).
Looking at code I see this is also explained in a comment:
/* General idea: XDP packets getting XDP redirected to another CPU, * will maximum be stored/queued for one driver ->poll() call. It is * guaranteed that queueing the frame and the flush operation happen on * same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr() * which queue in bpf_cpu_map_entry contains packets. */
Goal is to track the latency overhead added from ptr_ring and the kthread (versus softirq where is less overhead). Ideally we want p50, p90, p95, p99 percentiles.
I'm very interesting in this use-case of understanding the latency of CPUMAP. I'm a fan of latency histograms that I turn into heatmaps in grafana.
To do this, we need to track every single entry enqueue time as well as dequeue time - events that occur in the tail are quite important.
Since ptr_ring is also a ring buffer, I thought it would be easy, reliable, and fast to just create a "shadow" ring buffer. Every time producer enqueues entries, I'd enqueue the same number of current timestamp onto shadow RB. Same thing on consumer side, except dequeue and calculate timestamp delta.
This idea seems overkill and will likely produce unreliable results. E.g. the overhead of this additional ring buffer will also affect the measurements.
I was originally planning on writing my own lockless ring buffer in pure BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I could avoid that with this patch.
[...]
Alternatively, could add a u64 timestamp to xdp_frame, which makes all this tracking inline (and thus more reliable). But I'm not sure how precious the space in that struct is - I see some references online saying most drivers save 128B headroom. I also see:
#define XDP_PACKET_HEADROOM 256
I like the inline idea. I would suggest to add u64 timestamp into XDP-metadata area (ctx->data_meta code example[1]) , when XDP runs in RX-NAPI. Then at the remote CPU you can run another CPUMAP-XDP program that pickup this timestamp, and then calc a delta from "now" timestamp.
[1] https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/a...
Could probably amortize the timestamp read by setting it in bq_flush_to_queue().
To amortize, consider that you might not need to timestamp EVERY packet to get sufficient statistics on the latency.
Regarding bq_flush_to_queue() and the enqueue tracepoint: trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu)
I have an idea for you, on how to measure the latency overhead from XDP RX-processing to when enqueue "flush" happens. It is a little tricky to explain, so I will outline the steps.
1. XDP bpf_prog store timestamp in per-CPU array, unless timestamp is already set.
2. trace_xdp_cpumap_enqueue bpf_prog reads per-CPU timestamp and calc latency diff, and clears timestamp.
This measures the latency overhead of bulk enqueue. (Notice: Only the first XDP redirect frame after a bq_flush_to_queue() will set the timestamp). This per-CPU store should work as this all runs under same RX-NAPI "poll" execution.
This latency overhead of bulk enqueue, will (unfortunately) also count/measure the XDP_PASS packets that gets processed by the normal netstack. So, watch out for this. e.g could have XDP actions (e.g XDP_PASS) counters as part of step 1, and have statistic for cases where XDP_PASS interfered.
--Jesper