Hi Arnd,
I took a look on the stack usage issue in the kernel snippet you provided [1], and as you have noted the most impact indeed come from -ftree-ch optimization. It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
I am still fulling grasping what free-ch optimization does, but my understanding so far is it tries to reorganize the loop for later loop optimization phases. More specifically, what it ends up doing on the specific snippet is create extra stack variables for the internal membber access in the inner loop (which in its turns increase stack usage).
This is also why adding the compiler barrier inhibits the optimization, since it prevents the ftree-ch to optimize the internal loop reorganization and it is passed as is to later optimizations phases.
It is also a generic pass that affects all architecture, albeit the resulting stack will depend on later passes. With GCC 9.2.1 I see the resulting stack usage using -fstack-usage along with -O2:
arm 632 aarch64 448 powerpc 912 powerpc64le 560 s390 600 s390x 632 i386 1376 x86_64 784
Also, -fconserve-stack does not really help with this pass since ftree-ch does not check the flag usage. The fconserve-stack currently only seems to effect the inliner by setting both large-stack-frame and large-stack-frame-growth to some conservative values.
The straightforward change I am checking is just to disable tree-ch optimization if fconserve-stack is also enabled:
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index b894a7e0918..b14dd66257c 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -291,7 +291,8 @@ public: {}
/* opt_pass methods: */ - virtual bool gate (function *) { return flag_tree_ch != 0; } + virtual bool gate (function *) { return flag_tree_ch != 0 + && flag_conserve_stack == 0; }
/* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *);
On powerpc64le with gcc master:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 496 static
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage -fconserve-stack && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 176 static
The reference for minimal stack usage is with -Os:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -Os ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 32 static
I will try to check if also enable the same test for -fgcse and -free-ter do make sense.
On Fri, Nov 22, 2019 at 2:40 PM Adhemerval Zanella adhemerval.zanella@linaro.org wrote:
Hi Arnd,
I took a look on the stack usage issue in the kernel snippet you provided [1], and as you have noted the most impact indeed come from -ftree-ch optimization. It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
I am still fulling grasping what free-ch optimization does, but my understanding so far is it tries to reorganize the loop for later loop optimization phases. More specifically, what it ends up doing on the specific snippet is create extra stack variables for the internal membber access in the inner loop (which in its turns increase stack usage).
Thanks a lot for taking a detailed look!
This is also why adding the compiler barrier inhibits the optimization, since it prevents the ftree-ch to optimize the internal loop reorganization and it is passed as is to later optimizations phases.
It is also a generic pass that affects all architecture, albeit the resulting stack will depend on later passes. With GCC 9.2.1 I see the resulting stack usage using -fstack-usage along with -O2:
arm 632 aarch64 448 powerpc 912 powerpc64le 560 s390 600 s390x 632 i386 1376 x86_64 784
Also, -fconserve-stack does not really help with this pass since ftree-ch does not check the flag usage. The fconserve-stack currently only seems to effect the inliner by setting both large-stack-frame and large-stack-frame-growth to some conservative values.
The straightforward change I am checking is just to disable tree-ch optimization if fconserve-stack is also enabled:
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index b894a7e0918..b14dd66257c 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -291,7 +291,8 @@ public: {}
/* opt_pass methods: */
- virtual bool gate (function *) { return flag_tree_ch != 0; }
virtual bool gate (function *) { return flag_tree_ch != 0
&& flag_conserve_stack == 0; }
/* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *);
That assumes that ftree-ch generally results in higher stack usage, which is something we would have to confirm first. I've done similar checks before on other options, basically building a large project like the kernel with -Wframe-larger-than=128 (or similar), and then comparing the warning output with/without that flag.
That would tell us whether this is a systematic problem with -ftree-ch (making your patch a good idea) or whether the example code just hit a worst case that is otherwise rare, and turning off -ftree-ch generally just leads to worse output but no lower stack usage.
One suspicion I have is that this is related to not only having a large struct, but also having lots of 64-bit members in that struct and working on it on a 32-bit architecture.
Arnd
On 22/11/2019 10:55, Arnd Bergmann wrote:
On Fri, Nov 22, 2019 at 2:40 PM Adhemerval Zanella adhemerval.zanella@linaro.org wrote:
Hi Arnd,
I took a look on the stack usage issue in the kernel snippet you provided [1], and as you have noted the most impact indeed come from -ftree-ch optimization. It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
I am still fulling grasping what free-ch optimization does, but my understanding so far is it tries to reorganize the loop for later loop optimization phases. More specifically, what it ends up doing on the specific snippet is create extra stack variables for the internal membber access in the inner loop (which in its turns increase stack usage).
Thanks a lot for taking a detailed look!
This is also why adding the compiler barrier inhibits the optimization, since it prevents the ftree-ch to optimize the internal loop reorganization and it is passed as is to later optimizations phases.
It is also a generic pass that affects all architecture, albeit the resulting stack will depend on later passes. With GCC 9.2.1 I see the resulting stack usage using -fstack-usage along with -O2:
arm 632 aarch64 448 powerpc 912 powerpc64le 560 s390 600 s390x 632 i386 1376 x86_64 784
Also, -fconserve-stack does not really help with this pass since ftree-ch does not check the flag usage. The fconserve-stack currently only seems to effect the inliner by setting both large-stack-frame and large-stack-frame-growth to some conservative values.
The straightforward change I am checking is just to disable tree-ch optimization if fconserve-stack is also enabled:
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index b894a7e0918..b14dd66257c 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -291,7 +291,8 @@ public: {}
/* opt_pass methods: */
- virtual bool gate (function *) { return flag_tree_ch != 0; }
virtual bool gate (function *) { return flag_tree_ch != 0
&& flag_conserve_stack == 0; }
/* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *);
That assumes that ftree-ch generally results in higher stack usage, which is something we would have to confirm first. I've done similar checks before on other options, basically building a large project like the kernel with -Wframe-larger-than=128 (or similar), and then comparing the warning output with/without that flag.
That would tell us whether this is a systematic problem with -ftree-ch (making your patch a good idea) or whether the example code just hit a worst case that is otherwise rare, and turning off -ftree-ch generally just leads to worse output but no lower stack usage.
Yes, it is a big hammer and I am trying to check if I can get an estimate stack usage to check against param-large-stack-frame (set by fconserve-stack) as gcc-git/gcc/ipa-inline.c does for the inliner.
The idea is to keep free-ch enabled unless it hit some stack usage by the transformation.
One suspicion I have is that this is related to not only having a large struct, but also having lots of 64-bit members in that struct and working on it on a 32-bit architecture.
This is most likely increase the stack usage, changing the u64 definition on snippet to use 'unsigned long' I am seeing:
$ x86_64-glibc-linux-gnu-gcc -v -O2 -Wframe-larger-than=100 -Wa,--fatal-warnings stack_usage.c -fstack-usage -c -m32; cat stack_usage.su stack_usage.c:158:6:mlx5e_grp_sw_update_stats 472 static
From previous 1376 usage. However by disabling ftree-ch:
$ x86_64-glibc-linux-gnu-gcc -O2 -Wframe-larger-than=100 -Wa,--fatal-warnings stack_usage.c -fstack-usage -c -m32 -fno-tree-ch; cat stack_usage.su stack_usage.c:158:6:mlx5e_grp_sw_update_stats 16 static
Actually note it is not -ftree-ch which is causing the problem but rather -ftree-ch allows for other optimizations due their work. E.g. I need to turn off all of loop invariant code motion to get rid of the spilling: "-fno-tree-loop-im -fno-tree-pre -fno-move-loop-invariants -fno-gcse".
Also the reason why the memory barrier of the inline-asm works is because it tells the invariant code motion optimizations, there is some memory barrier that can have an effect on the code. Basically GCC does not have a good way to estimate register pressure for loop invariant code motion. It has some heurstics but those are not always good.
Thanks, Andrew Pinski
________________________________________ From: linaro-toolchain linaro-toolchain-bounces@lists.linaro.org on behalf of Adhemerval Zanella adhemerval.zanella@linaro.org Sent: Friday, November 22, 2019 5:40 AM To: Arnd Bergmann Cc: Linaro Toolchain Mailman List Subject: [EXT] High stack usage due ftree-ch
External Email
---------------------------------------------------------------------- Hi Arnd,
I took a look on the stack usage issue in the kernel snippet you provided [1], and as you have noted the most impact indeed come from -ftree-ch optimization. It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
I am still fulling grasping what free-ch optimization does, but my understanding so far is it tries to reorganize the loop for later loop optimization phases. More specifically, what it ends up doing on the specific snippet is create extra stack variables for the internal membber access in the inner loop (which in its turns increase stack usage).
This is also why adding the compiler barrier inhibits the optimization, since it prevents the ftree-ch to optimize the internal loop reorganization and it is passed as is to later optimizations phases.
It is also a generic pass that affects all architecture, albeit the resulting stack will depend on later passes. With GCC 9.2.1 I see the resulting stack usage using -fstack-usage along with -O2:
arm 632 aarch64 448 powerpc 912 powerpc64le 560 s390 600 s390x 632 i386 1376 x86_64 784
Also, -fconserve-stack does not really help with this pass since ftree-ch does not check the flag usage. The fconserve-stack currently only seems to effect the inliner by setting both large-stack-frame and large-stack-frame-growth to some conservative values.
The straightforward change I am checking is just to disable tree-ch optimization if fconserve-stack is also enabled:
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index b894a7e0918..b14dd66257c 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -291,7 +291,8 @@ public: {}
/* opt_pass methods: */ - virtual bool gate (function *) { return flag_tree_ch != 0; } + virtual bool gate (function *) { return flag_tree_ch != 0 + && flag_conserve_stack == 0; }
/* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *);
On powerpc64le with gcc master:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 496 static
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage -fconserve-stack && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 176 static
The reference for minimal stack usage is with -Os:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -Os ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 32 static
I will try to check if also enable the same test for -fgcse and -free-ter do make sense.
[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__godbolt.org_z_WKa-2DBd&... _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.linaro.org_mailma...
It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
It is disabled at -Os because it is duplicating the loop header; which in turn is considered increasing code size (though sometimes that can have a side effect of decreasing the code size later on but that is a different story). The increase of stack usage is due to register pressure with respect to other optimizations that can now work with the copied loop header. If anything, the register pressure heuristics needs improvement for code motion passes or the ability to undo those code motion while doing register allocation. THIS IS a HUGE project and should not be taken lightly. It just happens this code happens here and causes issues. It is not the normal case really.
Thanks, Andrew
________________________________________ From: linaro-toolchain linaro-toolchain-bounces@lists.linaro.org on behalf of Adhemerval Zanella adhemerval.zanella@linaro.org Sent: Friday, November 22, 2019 5:40 AM To: Arnd Bergmann Cc: Linaro Toolchain Mailman List Subject: [EXT] High stack usage due ftree-ch
External Email
---------------------------------------------------------------------- Hi Arnd,
I took a look on the stack usage issue in the kernel snippet you provided [1], and as you have noted the most impact indeed come from -ftree-ch optimization. It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
I am still fulling grasping what free-ch optimization does, but my understanding so far is it tries to reorganize the loop for later loop optimization phases. More specifically, what it ends up doing on the specific snippet is create extra stack variables for the internal membber access in the inner loop (which in its turns increase stack usage).
This is also why adding the compiler barrier inhibits the optimization, since it prevents the ftree-ch to optimize the internal loop reorganization and it is passed as is to later optimizations phases.
It is also a generic pass that affects all architecture, albeit the resulting stack will depend on later passes. With GCC 9.2.1 I see the resulting stack usage using -fstack-usage along with -O2:
arm 632 aarch64 448 powerpc 912 powerpc64le 560 s390 600 s390x 632 i386 1376 x86_64 784
Also, -fconserve-stack does not really help with this pass since ftree-ch does not check the flag usage. The fconserve-stack currently only seems to effect the inliner by setting both large-stack-frame and large-stack-frame-growth to some conservative values.
The straightforward change I am checking is just to disable tree-ch optimization if fconserve-stack is also enabled:
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index b894a7e0918..b14dd66257c 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -291,7 +291,8 @@ public: {}
/* opt_pass methods: */ - virtual bool gate (function *) { return flag_tree_ch != 0; } + virtual bool gate (function *) { return flag_tree_ch != 0 + && flag_conserve_stack == 0; }
/* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *);
On powerpc64le with gcc master:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 496 static
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage -fconserve-stack && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 176 static
The reference for minimal stack usage is with -Os:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -Os ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 32 static
I will try to check if also enable the same test for -fgcse and -free-ter do make sense.
[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__godbolt.org_z_WKa-2DBd&... _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.linaro.org_mailma...
On 22/11/2019 11:38, Andrew Pinski wrote:
It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
It is disabled at -Os because it is duplicating the loop header; which in turn is considered increasing code size (though sometimes that can have a side effect of decreasing the code size later on but that is a different story). The increase of stack usage is due to register pressure with respect to other optimizations that can now work with the copied loop header. If anything, the register pressure heuristics needs improvement for code motion passes or the ability to undo those code motion while doing register allocation. THIS IS a HUGE project and should not be taken lightly. It just happens this code happens here and causes issues. It is not the normal case really.
Thanks for the information, at least for the specific snippet it seems that both -fno-tree-loop-im and -fno-tree-pre are the ones generating most spilling.
So the question I have it is worth to disable -free-ch when -fstack-conserve is set (since it the flag idea to prevent such pessimizations) or the idea is just to disable -ftree-ch for such cases.
Thanks, Andrew
From: linaro-toolchain linaro-toolchain-bounces@lists.linaro.org on behalf of Adhemerval Zanella adhemerval.zanella@linaro.org Sent: Friday, November 22, 2019 5:40 AM To: Arnd Bergmann Cc: Linaro Toolchain Mailman List Subject: [EXT] High stack usage due ftree-ch
External Email
Hi Arnd,
I took a look on the stack usage issue in the kernel snippet you provided [1], and as you have noted the most impact indeed come from -ftree-ch optimization. It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
I am still fulling grasping what free-ch optimization does, but my understanding so far is it tries to reorganize the loop for later loop optimization phases. More specifically, what it ends up doing on the specific snippet is create extra stack variables for the internal membber access in the inner loop (which in its turns increase stack usage).
This is also why adding the compiler barrier inhibits the optimization, since it prevents the ftree-ch to optimize the internal loop reorganization and it is passed as is to later optimizations phases.
It is also a generic pass that affects all architecture, albeit the resulting stack will depend on later passes. With GCC 9.2.1 I see the resulting stack usage using -fstack-usage along with -O2:
arm 632 aarch64 448 powerpc 912 powerpc64le 560 s390 600 s390x 632 i386 1376 x86_64 784
Also, -fconserve-stack does not really help with this pass since ftree-ch does not check the flag usage. The fconserve-stack currently only seems to effect the inliner by setting both large-stack-frame and large-stack-frame-growth to some conservative values.
The straightforward change I am checking is just to disable tree-ch optimization if fconserve-stack is also enabled:
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index b894a7e0918..b14dd66257c 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -291,7 +291,8 @@ public: {}
/* opt_pass methods: */
- virtual bool gate (function *) { return flag_tree_ch != 0; }
virtual bool gate (function *) { return flag_tree_ch != 0
&& flag_conserve_stack == 0; }
/* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *);
On powerpc64le with gcc master:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 496 static
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage -fconserve-stack && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 176 static
The reference for minimal stack usage is with -Os:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -Os ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 32 static
I will try to check if also enable the same test for -fgcse and -free-ter do make sense.
[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__godbolt.org_z_WKa-2DBd&... _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.linaro.org_mailma...
Thanks for the information, at least for the specific snippet it seems that both -fno-tree-loop-im and -fno-tree-pre are the ones generating most spilling.
That is because the code motion is happening at the RTL level: -fno-gcse is the one you are looking for.
________________________________________ From: Adhemerval Zanella adhemerval.zanella@linaro.org Sent: Friday, November 22, 2019 6:41 AM To: Andrew Pinski; Arnd Bergmann Cc: Linaro Toolchain Mailman List Subject: Re: [EXT] High stack usage due ftree-ch
On 22/11/2019 11:38, Andrew Pinski wrote:
It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
It is disabled at -Os because it is duplicating the loop header; which in turn is considered increasing code size (though sometimes that can have a side effect of decreasing the code size later on but that is a different story). The increase of stack usage is due to register pressure with respect to other optimizations that can now work with the copied loop header. If anything, the register pressure heuristics needs improvement for code motion passes or the ability to undo those code motion while doing register allocation. THIS IS a HUGE project and should not be taken lightly. It just happens this code happens here and causes issues. It is not the normal case really.
Thanks for the information, at least for the specific snippet it seems that both -fno-tree-loop-im and -fno-tree-pre are the ones generating most spilling.
So the question I have it is worth to disable -free-ch when -fstack-conserve is set (since it the flag idea to prevent such pessimizations) or the idea is just to disable -ftree-ch for such cases.
Thanks, Andrew
From: linaro-toolchain linaro-toolchain-bounces@lists.linaro.org on behalf of Adhemerval Zanella adhemerval.zanella@linaro.org Sent: Friday, November 22, 2019 5:40 AM To: Arnd Bergmann Cc: Linaro Toolchain Mailman List Subject: [EXT] High stack usage due ftree-ch
External Email
Hi Arnd,
I took a look on the stack usage issue in the kernel snippet you provided [1], and as you have noted the most impact indeed come from -ftree-ch optimization. It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
I am still fulling grasping what free-ch optimization does, but my understanding so far is it tries to reorganize the loop for later loop optimization phases. More specifically, what it ends up doing on the specific snippet is create extra stack variables for the internal membber access in the inner loop (which in its turns increase stack usage).
This is also why adding the compiler barrier inhibits the optimization, since it prevents the ftree-ch to optimize the internal loop reorganization and it is passed as is to later optimizations phases.
It is also a generic pass that affects all architecture, albeit the resulting stack will depend on later passes. With GCC 9.2.1 I see the resulting stack usage using -fstack-usage along with -O2:
arm 632 aarch64 448 powerpc 912 powerpc64le 560 s390 600 s390x 632 i386 1376 x86_64 784
Also, -fconserve-stack does not really help with this pass since ftree-ch does not check the flag usage. The fconserve-stack currently only seems to effect the inliner by setting both large-stack-frame and large-stack-frame-growth to some conservative values.
The straightforward change I am checking is just to disable tree-ch optimization if fconserve-stack is also enabled:
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index b894a7e0918..b14dd66257c 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -291,7 +291,8 @@ public: {}
/* opt_pass methods: */
- virtual bool gate (function *) { return flag_tree_ch != 0; }
virtual bool gate (function *) { return flag_tree_ch != 0
&& flag_conserve_stack == 0; }
/* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *);
On powerpc64le with gcc master:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 496 static
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage -fconserve-stack && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 176 static
The reference for minimal stack usage is with -Os:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -Os ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 32 static
I will try to check if also enable the same test for -fgcse and -free-ter do make sense.
[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__godbolt.org_z_WKa-2DBd&... _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.linaro.org_mailma...
I should say that you need all three options to prevent the code motion from happening: -fno-tree-loop-im -fno-tree-pre -fno-gcse
-fno-tree-ch prevents the code motion from happening too but only on accident; in that all three of the code motion passes (the two on the gimple and one on RTL) won't work with the loop in that form. Disabling copy header optimization for flag_conserve_stack is the wrong approach. Again you need to look into each of the code motion passes to understand the register pressure heuristics and why they do the code motion.
Also I have not looked into why the RTL loop invariant code motion pass did NOTHING here.
Thanks, Andrew Pinski
________________________________________ From: linaro-toolchain linaro-toolchain-bounces@lists.linaro.org on behalf of Andrew Pinski apinski@marvell.com Sent: Friday, November 22, 2019 6:44 AM To: Adhemerval Zanella; Arnd Bergmann Cc: Linaro Toolchain Mailman List Subject: Re: [EXT] High stack usage due ftree-ch
Thanks for the information, at least for the specific snippet it seems that both -fno-tree-loop-im and -fno-tree-pre are the ones generating most spilling.
That is because the code motion is happening at the RTL level: -fno-gcse is the one you are looking for.
________________________________________ From: Adhemerval Zanella adhemerval.zanella@linaro.org Sent: Friday, November 22, 2019 6:41 AM To: Andrew Pinski; Arnd Bergmann Cc: Linaro Toolchain Mailman List Subject: Re: [EXT] High stack usage due ftree-ch
On 22/11/2019 11:38, Andrew Pinski wrote:
It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
It is disabled at -Os because it is duplicating the loop header; which in turn is considered increasing code size (though sometimes that can have a side effect of decreasing the code size later on but that is a different story). The increase of stack usage is due to register pressure with respect to other optimizations that can now work with the copied loop header. If anything, the register pressure heuristics needs improvement for code motion passes or the ability to undo those code motion while doing register allocation. THIS IS a HUGE project and should not be taken lightly. It just happens this code happens here and causes issues. It is not the normal case really.
Thanks for the information, at least for the specific snippet it seems that both -fno-tree-loop-im and -fno-tree-pre are the ones generating most spilling.
So the question I have it is worth to disable -free-ch when -fstack-conserve is set (since it the flag idea to prevent such pessimizations) or the idea is just to disable -ftree-ch for such cases.
Thanks, Andrew
From: linaro-toolchain linaro-toolchain-bounces@lists.linaro.org on behalf of Adhemerval Zanella adhemerval.zanella@linaro.org Sent: Friday, November 22, 2019 5:40 AM To: Arnd Bergmann Cc: Linaro Toolchain Mailman List Subject: [EXT] High stack usage due ftree-ch
External Email
Hi Arnd,
I took a look on the stack usage issue in the kernel snippet you provided [1], and as you have noted the most impact indeed come from -ftree-ch optimization. It is enabled in all optimization levels besides -Os (since besides possible increasing the stack usage it also might increase code side).
I am still fulling grasping what free-ch optimization does, but my understanding so far is it tries to reorganize the loop for later loop optimization phases. More specifically, what it ends up doing on the specific snippet is create extra stack variables for the internal membber access in the inner loop (which in its turns increase stack usage).
This is also why adding the compiler barrier inhibits the optimization, since it prevents the ftree-ch to optimize the internal loop reorganization and it is passed as is to later optimizations phases.
It is also a generic pass that affects all architecture, albeit the resulting stack will depend on later passes. With GCC 9.2.1 I see the resulting stack usage using -fstack-usage along with -O2:
arm 632 aarch64 448 powerpc 912 powerpc64le 560 s390 600 s390x 632 i386 1376 x86_64 784
Also, -fconserve-stack does not really help with this pass since ftree-ch does not check the flag usage. The fconserve-stack currently only seems to effect the inliner by setting both large-stack-frame and large-stack-frame-growth to some conservative values.
The straightforward change I am checking is just to disable tree-ch optimization if fconserve-stack is also enabled:
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index b894a7e0918..b14dd66257c 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -291,7 +291,8 @@ public: {}
/* opt_pass methods: */
- virtual bool gate (function *) { return flag_tree_ch != 0; }
virtual bool gate (function *) { return flag_tree_ch != 0
&& flag_conserve_stack == 0; }
/* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *);
On powerpc64le with gcc master:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 496 static
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -O2 ../stack_usage.c -c -fstack-usage -fconserve-stack && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 176 static
The reference for minimal stack usage is with -Os:
$ /home/azanella/gcc/gcc-git-build/gcc/xgcc -B /home/azanella/gcc/gcc-git-build/gcc -Os ../stack_usage.c -c -fstack-usage && cat stack_usage.su ../stack_usage.c:157:6:mlx5e_grp_sw_update_stats 32 static
I will try to check if also enable the same test for -fgcse and -free-ter do make sense.
[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__godbolt.org_z_WKa-2DBd&... _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.linaro.org_mailma...
_______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.linaro.org_mailma...
On 22/11/2019 11:52, Andrew Pinski wrote:
I should say that you need all three options to prevent the code motion from happening: -fno-tree-loop-im -fno-tree-pre -fno-gcse
-fno-tree-ch prevents the code motion from happening too but only on accident; in that all three of the code motion passes (the two on the gimple and one on RTL) won't work with the loop in that form. Disabling copy header optimization for flag_conserve_stack is the wrong approach. Again you need to look into each of the code motion passes to understand the register pressure heuristics and why they do the code motion.
Also I have not looked into why the RTL loop invariant code motion pass did NOTHING here.
Thanks, Andrew Pinski
linaro-toolchain@lists.linaro.org