Hi All,
I am working on log10/qsort benchmarks on ARM64 (ARMv8) processor,
I want to check if we have experience with these benchmarks. Actually i am looking for a compiler version which gives best results with these benchmarks and specific compiler optimization (in my case is see O3 gives best numbers) ?
I have tried GCC-4.9 and GCC-6.2 with log10 benchmark and my observations are:
1) With gcc 4.9 - 140 us
2) With GCC 6.2 - 150 us
My compilation flags are "-O3 -ftree-vectorize -funroll-all-loops --param max-inline-insns-auto=550 --param case-values-threshold=30 -falign-functions=32 -ftracer"
So it seems like gcc-6.2 is better, am i missing something, should i use some better compiler flags?
Thanks
-Bharat
On Tue, Feb 7, 2017 at 7:50 AM, Bharat Bhushan bharat.bhushan@linaro.org wrote:
I am working on log10/qsort benchmarks on ARM64 (ARMv8) processor, I want to check if we have experience with these benchmarks.
We have experience with things like SPEC and Coremark, which are compiler performance benchmarks. log10/qsort sounds like glibc functions. Are you testing glibc performance? That would perhaps depend more on the glibc version than the compiler version.
Actually i am looking for a compiler version which gives best results with these benchmarks and specific compiler optimization (in my case is see O3 gives best numbers) ?
What exactly are you trying to optimize? If you want best performance for your application, then you try every compiler version and every option and use the combination that gives the best performance. Us toolchain developers only care about performance of the latest version, and if it isn't the best performing one, then we try to fix it. If you want best performance for the most people, then you concentrate on -O2 results as that is what most people use. I can't give a better answer without more specifics of what exactly you are trying to do.
I have tried GCC-4.9 and GCC-6.2 with log10 benchmark and my observations are:
With gcc 4.9 - 140 us 2) With GCC 6.2 - 150 us
My compilation flags are "-O3 -ftree-vectorize -funroll-all-loops --param max-inline-insns-auto=550 --param case-values-threshold=30 -falign-functions=32 -ftracer"
So it seems like gcc-6.2 is better, am i missing something, should i use some better compiler flags?
Usually for benchmarks, a faster runtime is a better result, so it looks like gcc-4.9 is giving the better result. If that is a gcc-6 bug, then it should be reported so we can try to fix it. However, you are using a lot of options, and some of those options aren't the default because they don't always give the best results. The usefulness of some uncommon optimization options can vary from one gcc release to the next. You may need to use different sets of gcc options with different gcc versions to get the best results. But again, as mentioned above, this all depends on what exactly you are trying to do, and you haven't given us enough info to understand that.
Jim
Thanks Jim,
When I uses "-mtune and/or -mcpu" with GCC6.2 then I see almost same number as with GC4.9
Thanks -Bharat
On 7 February 2017 at 23:37, Jim Wilson jim.wilson@linaro.org wrote:
On Tue, Feb 7, 2017 at 7:50 AM, Bharat Bhushan bharat.bhushan@linaro.org wrote:
I am working on log10/qsort benchmarks on ARM64 (ARMv8) processor, I want to check if we have experience with these benchmarks.
We have experience with things like SPEC and Coremark, which are compiler performance benchmarks. log10/qsort sounds like glibc functions. Are you testing glibc performance? That would perhaps depend more on the glibc version than the compiler version.
Actually i am looking for a compiler version which gives best results
with
these benchmarks and specific compiler optimization (in my case is see O3 gives best numbers) ?
What exactly are you trying to optimize? If you want best performance for your application, then you try every compiler version and every option and use the combination that gives the best performance. Us toolchain developers only care about performance of the latest version, and if it isn't the best performing one, then we try to fix it. If you want best performance for the most people, then you concentrate on -O2 results as that is what most people use. I can't give a better answer without more specifics of what exactly you are trying to do.
I have tried GCC-4.9 and GCC-6.2 with log10 benchmark and my observations are:
With gcc 4.9 - 140 us 2) With GCC 6.2 - 150 us
My compilation flags are "-O3 -ftree-vectorize -funroll-all-loops --param max-inline-insns-auto=550 --param case-values-threshold=30 -falign-functions=32 -ftracer"
So it seems like gcc-6.2 is better, am i missing something, should i use some better compiler flags?
Usually for benchmarks, a faster runtime is a better result, so it looks like gcc-4.9 is giving the better result. If that is a gcc-6 bug, then it should be reported so we can try to fix it. However, you are using a lot of options, and some of those options aren't the default because they don't always give the best results. The usefulness of some uncommon optimization options can vary from one gcc release to the next. You may need to use different sets of gcc options with different gcc versions to get the best results. But again, as mentioned above, this all depends on what exactly you are trying to do, and you haven't given us enough info to understand that.
Jim
On 07/02/2017 13:50, Bharat Bhushan wrote:
Hi All,
I am working on log10/qsort benchmarks on ARM64 (ARMv8) processor,
I want to check if we have experience with these benchmarks. Actually i am looking for a compiler version which gives best results with these benchmarks and specific compiler optimization (in my case is see O3 gives best numbers) ?
I have tried GCC-4.9 and GCC-6.2 with log10 benchmark and my observations are:
With gcc 4.9 - 140 us 2) With GCC 6.2 - 150 us
My compilation flags are "-O3 -ftree-vectorize -funroll-all-loops --param max-inline-insns-auto=550 --param case-values-threshold=30 -falign-functions=32 -ftracer"
So it seems like gcc-6.2 is better, am i missing something, should i use some better compiler flags?
It is really hard to give you any advise without actual code to check what exactly you are measuring. Are you using a custom implemented log10 or the glibc one?
The compiler options seems what you expect to use for a mathematical workload, however I would profile and check if both '-funroll-all-loops' and the '--param max-inline-insns-auto=550 --param case-values-threshold=30' are actually helping on this case. All tend to increase code size and it might or not be the case where it put icache pressure, it really really depend of the workload and dataflow.
In any way, it would be good to profile the code to check exactly where is the hotspot and based on the code and its characteristics check if any other flags or even kind of optimization (pgo, ipa) can help you out.
linaro-toolchain@lists.linaro.org