On Thu, Feb 6, 2025 at 2:20 AM Yan Zhai yan@cloudflare.com wrote:
On Thu, Feb 6, 2025 at 12:40 AM Hou Tao houtao@huaweicloud.com wrote:
Yes, the retry logic doesn't change the previous key. Thanks for the clarifying.
And by "skipping to the next key", it's simply
if (err == -ENOENT) goto next_key;
Note the "next_key" label was not in the current codebase. It is only in my posted patch. I don't think this would break lpm_trie unless I missed something.
I was trying to say that the proposed patch may break the lookup_batch operation for lpm_trie, and let me explain step by step:
For LPM trie map: (1) ->map_get_next_key(map, prev_key, key) returns a valid key
(2) bpf_map_copy_value() return -ENOMENT It means the key must be deleted concurrently.
I see what you mean now, thanks for the detailed explanation!
So for lpm_trie, if an element is deleted between get_next_key and copy_value, then retry would still proceed to its next tree node. But with or without retry, I think we cannot prevent the key from cycling back to the beginning: an element can be deleted after copy_value, so the key becomes invalid. After swap with prev_key and call bpf_get_next_key, it cycles back to the leftmost. Similar can happen if during retry the prev_key also gets invalidated. IIUC the rcu locking in lookup really cannot prevent the tree structure from changes.
On the other hand, bpf_map_get_next_key manual already specifies "If a key is not found, the operation returns zero and sets the next_key pointer to the key of the first element.". IMHO it probably makes more sense that, regardless of normal or batch lookup, it is ultimately the user's responsibility to properly synchronize, or deduplicate what is returned from the kernel. I intend to keep the "skip to next" to save the unnecessary complexity here, unless there are strong objections that I should not.
Thought about this again. It is interesting that hashmap batch lookup actually guarantees no duplicate in output today. If we want to achieve the same for lpm_trie, it should have its own batch lookup routine, which could properly lock the trie against mutation. For the generic batch, it is still better to be simple. Let me put this in the V2 patchset for discussion.
best Yan
(3) goto next_key It swaps the prev_key and key
(4) ->map_get_next_key(map, prev_key, key) again prev_key points to a non-existing key, for LPM trie it will treat just like prev_key=NULL case, the returned key will be duplicated.