mas_skip_node() is used to move the maple state to the node with a higher limit. It does this by walking up the tree and increasing the slot count. Since slot count may not be able to be increased, it may need to walk up multiple times to find room to walk right to a higher limit node. The limit of slots that was being used was the node limit and not the last location of data in the node. This would cause the maple state to be shifted outside actual data and enter an error state, thus returning -EBUSY.
The result of the incorrect error state means that mas_awalk() would return an error instead of finding the allocation space.
The fix is to use mas_data_end() in mas_skip_node() to detect the nodes data end point and continue walking the tree up until it is safe to move to a node with a higher limit.
mas_skip_node() may also be passed a maple state in an error state from mas_anode_descend() when no allocations are available. Return on such an error state immediately.
Reported-by: Snild Dolkow snild@sony.com Link: https://lore.kernel.org/linux-mm/cb8dc31a-fef2-1d09-f133-e9f7b9f9e77a@sony.c... Cc: Stable@vger.kernel.org Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com --- lib/maple_tree.c | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 2be86368237d..2efe854946d6 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5188,34 +5188,29 @@ static inline bool mas_rewind_node(struct ma_state *mas) */ static inline bool mas_skip_node(struct ma_state *mas) { - unsigned char slot, slot_count; unsigned long *pivots; enum maple_type mt;
- mt = mte_node_type(mas->node); - slot_count = mt_slots[mt] - 1; + if (mas_is_err(mas)) + return false; + do { if (mte_is_root(mas->node)) { - slot = mas->offset; - if (slot > slot_count) { + if (mas->offset >= mas_data_end(mas)) { mas_set_err(mas, -EBUSY); return false; } } else { mas_ascend(mas); - slot = mas->offset; - mt = mte_node_type(mas->node); - slot_count = mt_slots[mt] - 1; } - } while (slot > slot_count); + } while (mas->offset >= mas_data_end(mas));
- mas->offset = ++slot; + mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - if (slot > 0) - mas->min = pivots[slot - 1] + 1; - - if (slot <= slot_count) - mas->max = pivots[slot]; + mas->min = pivots[mas->offset] + 1; + mas->offset++; + if (mas->offset < mt_slots[mt]) + mas->max = pivots[mas->offset];
return true; }
Test robust filling of an entire area of the tree, then test one beyond. This is to test the walking back up the tree at the end of nodes and error condition.
Test inspired by the reproducer code provided by Snild Dolkow.
Cc: Snild Dolkow snild@sony.com Link: https://lore.kernel.org/linux-mm/cb8dc31a-fef2-1d09-f133-e9f7b9f9e77a@sony.c... Cc: Stable@vger.kernel.org Fixes: e15e06a83923 ("lib/test_maple_tree: add testing for maple tree") Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com --- lib/test_maple_tree.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)
diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c index 3d19b1f78d71..fa86874763f0 100644 --- a/lib/test_maple_tree.c +++ b/lib/test_maple_tree.c @@ -2670,6 +2670,36 @@ static noinline void check_empty_area_window(struct maple_tree *mt) rcu_read_unlock(); }
+static noinline void check_empty_area_fill(struct maple_tree *mt) +{ + int loop, shift; + unsigned long max = 0x25D78000; + unsigned long size; + MA_STATE(mas, mt, 0, 0); + + mt_set_non_kernel(99999); + for (shift = 12; shift <= 16; shift++) { + loop = 5000; + size = 1 << shift; + while (loop--) { + mas_lock(&mas); + MT_BUG_ON(mt, mas_empty_area(&mas, 0, max, size) != 0); + MT_BUG_ON(mt, mas.last != mas.index + size - 1); + mas_store_gfp(&mas, &check_empty_area_fill, GFP_KERNEL); + mas_unlock(&mas); + mas_reset(&mas); + } + } + + /* No space left. */ + size = 0x1000; + rcu_read_lock(); + MT_BUG_ON(mt, mas_empty_area(&mas, 0, max, size) != -EBUSY); + rcu_read_unlock(); + + mt_set_non_kernel(0); +} + static DEFINE_MTREE(tree); static int maple_tree_seed(void) { @@ -2926,6 +2956,11 @@ static int maple_tree_seed(void) check_empty_area_window(&tree); mtree_destroy(&tree);
+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); + check_empty_area_fill(&tree); + mtree_destroy(&tree); + + #if defined(BENCH) skip: #endif
Thanks, this is a significant improvement. Applying it on top of v6.1.12 allows my reproducer to pass most of the time (running as init in qemu).
Unfortunately, it's still failing around 10% of the time:
$ for x in $(seq 100); do qemu-system-x86_64 -nographic -no-reboot -append 'console=ttyS0 panic=-1' -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz; done | tee qemu.log [...] > $ egrep -o 'Failed|Success' qemu.log | sort | uniq -c 11 Failed 89 Success
The failures now happen later, around 25 MiB:
$ grep Failed qemu.log Failed. m=0xffffffffffffffff size=8192 (1<<13) i=1050 errno=12 total_leaks=29081600 (27 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=332 errno=12 total_leaks=23199744 (22 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=838 errno=12 total_leaks=27344896 (26 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=282 errno=12 total_leaks=22790144 (21 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=695 errno=12 total_leaks=26173440 (24 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=1064 errno=12 total_leaks=29196288 (27 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=608 errno=12 total_leaks=25460736 (24 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=443 errno=12 total_leaks=24109056 (22 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=549 errno=12 total_leaks=24977408 (23 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=630 errno=12 total_leaks=25640960 (24 MiB) Failed. m=0xffffffffffffffff size=8192 (1<<13) i=820 errno=12 total_leaks=27197440 (25 MiB)
Just to make sure, I went back to e15e06a8 and ran the same loop.
$ egrep -o 'Failed|Success' qemu.log | sort | uniq -c 100 Success
And with the patches applied on top of master (ee3f96b1):
$ egrep -o 'Failed|Success' qemu.log | sort | uniq -c 10 Failed 90 Success
//Snild
Hi, Liam,
mas_skip_node() is used to move the maple state to the node with a higher limit. It does this by walking up the tree and increasing the slot count. Since slot count may not be able to be increased, it may need to walk up multiple times to find room to walk right to a higher limit node. The limit of slots that was being used was the node limit and not the last location of data in the node. This would cause the maple state to be shifted outside actual data and enter an error state, thus returning -EBUSY.
The result of the incorrect error state means that mas_awalk() would return an error instead of finding the allocation space.
The fix is to use mas_data_end() in mas_skip_node() to detect the nodes data end point and continue walking the tree up until it is safe to move to a node with a higher limit.
mas_skip_node() may also be passed a maple state in an error state from mas_anode_descend() when no allocations are available. Return on such an error state immediately.
Reported-by: Snild Dolkow snild@sony.com Link: https://lore.kernel.org/linux-mm/cb8dc31a-fef2-1d09-f133-e9f7b9f9e77a@sony.c... Cc: Stable@vger.kernel.org Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
Reviewed-by: Peng Zhang zhangpeng.00@bytedance.com
lib/maple_tree.c | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 2be86368237d..2efe854946d6 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5188,34 +5188,29 @@ static inline bool mas_rewind_node(struct ma_state *mas) */ static inline bool mas_skip_node(struct ma_state *mas) {
- unsigned char slot, slot_count; unsigned long *pivots; enum maple_type mt;
- mt = mte_node_type(mas->node);
- slot_count = mt_slots[mt] - 1;
- if (mas_is_err(mas))
return false;
- do { if (mte_is_root(mas->node)) {
slot = mas->offset;
if (slot > slot_count) {
} else { mas_ascend(mas);if (mas->offset >= mas_data_end(mas)) { mas_set_err(mas, -EBUSY); return false; }
slot = mas->offset;
mt = mte_node_type(mas->node);
}slot_count = mt_slots[mt] - 1;
- } while (slot > slot_count);
- } while (mas->offset >= mas_data_end(mas));
- mas->offset = ++slot;
- mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt);
- if (slot > 0)
mas->min = pivots[slot - 1] + 1;
- if (slot <= slot_count)
mas->max = pivots[slot];
- mas->min = pivots[mas->offset] + 1;
- mas->offset++;
- if (mas->offset < mt_slots[mt])
mas->max = pivots[mas->offset];
There is a bug here, the assignment of mas->min and mas->max is wrong. The assignment will make them represent the range of a child node, but it should represent the range of the current node. After mas_ascend() returns, mas-min and mas->max already represent the range of the current node, so we should delete these assignments of mas->min and mas->max.
return true; }
Sincerely yours, Peng.
On 2023-03-07 14:05, Peng Zhang wrote:
Hi, Liam,
- } while (slot > slot_count); + } while (mas->offset >= mas_data_end(mas)); - mas->offset = ++slot; + mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - if (slot > 0) - mas->min = pivots[slot - 1] + 1;
- if (slot <= slot_count) - mas->max = pivots[slot]; + mas->min = pivots[mas->offset] + 1; + mas->offset++; + if (mas->offset < mt_slots[mt]) + mas->max = pivots[mas->offset];
There is a bug here, the assignment of mas->min and mas->max is wrong. The assignment will make them represent the range of a child node, but it should represent the range of the current node. After mas_ascend() returns, mas-min and mas->max already represent the range of the current node, so we should delete these assignments of mas->min and mas->max.
Thanks for your suggestion, Peng. Applying it literally by removing only the min/max assignments:
diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 6fc1ad42b409..9b6e581cf83f 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5118,10 +5118,7 @@ static inline bool mas_skip_node
mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - mas->min = pivots[mas->offset] + 1; mas->offset++; - if (mas->offset < mt_slots[mt]) - mas->max = pivots[mas->offset];
return true; }
This allowed my test to pass 100/100 runs. Still in qemu with the test as init, so not really stressed in any way except that specific usecase.
//Snild
* Snild Dolkow snild@sony.com [230307 09:31]:
On 2023-03-07 14:05, Peng Zhang wrote:
Hi, Liam,
- } while (slot > slot_count); + } while (mas->offset >= mas_data_end(mas)); - mas->offset = ++slot; + mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - if (slot > 0) - mas->min = pivots[slot - 1] + 1;
- if (slot <= slot_count) - mas->max = pivots[slot]; + mas->min = pivots[mas->offset] + 1; + mas->offset++; + if (mas->offset < mt_slots[mt]) + mas->max = pivots[mas->offset];
There is a bug here, the assignment of mas->min and mas->max is wrong. The assignment will make them represent the range of a child node, but it should represent the range of the current node. After mas_ascend() returns, mas-min and mas->max already represent the range of the current node, so we should delete these assignments of mas->min and mas->max.
Thanks for your suggestion, Peng. Applying it literally by removing only the min/max assignments:
diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 6fc1ad42b409..9b6e581cf83f 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5118,10 +5118,7 @@ static inline bool mas_skip_node
mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt);
mas->min = pivots[mas->offset] + 1; mas->offset++;
if (mas->offset < mt_slots[mt])
mas->max = pivots[mas->offset]; return true;
}
This allowed my test to pass 100/100 runs. Still in qemu with the test as init, so not really stressed in any way except that specific usecase.
Yes, I have a v2 that removes these lines and some extra unused lines. I've been working on a recreation to add to the testing before I send v2 out. I had 100% success running my testing over the weekend, but I wanted to have a testcase before sending the change.
Thanks, Liam
Hi, Snild,
在 2023/3/7 22:30, Snild Dolkow 写道:
On 2023-03-07 14:05, Peng Zhang wrote:
Hi, Liam,
- } while (slot > slot_count); + } while (mas->offset >= mas_data_end(mas)); - mas->offset = ++slot; + mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - if (slot > 0) - mas->min = pivots[slot - 1] + 1;
- if (slot <= slot_count) - mas->max = pivots[slot]; + mas->min = pivots[mas->offset] + 1; + mas->offset++; + if (mas->offset < mt_slots[mt]) + mas->max = pivots[mas->offset];
There is a bug here, the assignment of mas->min and mas->max is wrong. The assignment will make them represent the range of a child node, but it should represent the range of the current node. After mas_ascend() returns, mas-min and mas->max already represent the range of the current node, so we should delete these assignments of mas->min and mas->max.
Thanks for your suggestion, Peng. Applying it literally by removing only the min/max assignments:
diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 6fc1ad42b409..9b6e581cf83f 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5118,10 +5118,7 @@ static inline bool mas_skip_node
mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - mas->min = pivots[mas->offset] + 1; mas->offset++; - if (mas->offset < mt_slots[mt]) - mas->max = pivots[mas->offset];
return true; }
This allowed my test to pass 100/100 runs. Still in qemu with the test as init, so not really stressed in any way except that specific usecase.
//Snild
Thanks for the test, I'm happy if it happens to fix your problem. So a patch was made. This patch needs to be applied after Liam's patch.
Sincerely yours, Peng.
Sorry I forgot to post the link, here it is: https://lore.kernel.org/lkml/20230307160340.57074-1-zhangpeng.00@bytedance.c...
* Peng Zhang zhangpeng.00@bytedance.com [230307 11:21]:
Sorry I forgot to post the link, here it is: https://lore.kernel.org/lkml/20230307160340.57074-1-zhangpeng.00@bytedance.c...
I will roll this into V2 of my patch set. It is exactly what I have, but I need to come up with a testcase for this first.
linux-stable-mirror@lists.linaro.org