Re: [Linaro-TCWG-CI] gcc patch #93154: FAIL: 1 regressions on arm

6 Jul 2024


      Hello Paul,
Paul Richard Thomas paul.richard.thomas@gmail.com writes:
...
Hi There,
I have been withholding the commit of this patch until I hear from you.
Sorry for the late response. I don't know much about Fortran or
gfortran, but I tried to have a look at the failure. More details below,
but unfortunately I didn't find anything concrete. Hopefully the Valgrind
reports can help.
Please let me know if there are other tests or investigation I can make.
...
On Tue, 2 Jul 2024 at 08:48, Paul Richard Thomas <
paul.richard.thomas@gmail.com> wrote:
...
Hi there,
You detected a failure in gfortran.dg/class_transformational_2.f90:
PASS: gfortran.dg/class_transformational_2.f90   -O0  (test for excess
errors)
PASS: gfortran.dg/class_transformational_2.f90   -O0  execution test
PASS: gfortran.dg/class_transformational_2.f90   -O1  (test for excess
errors)
FAIL: gfortran.dg/class_transformational_2.f90   -O1  execution test
PASS: gfortran.dg/class_transformational_2.f90   -O2  (test for excess
errors)
PASS: gfortran.dg/class_transformational_2.f90   -O2  execution test
PASS: gfortran.dg/class_transformational_2.f90   -O3 -fomit-frame-pointer
...snip...
PASS: gfortran.dg/class_transformational_2.f90   -O3 -fomit-frame-pointer
...snip...
PASS: gfortran.dg/class_transformational_2.f90   -O3 -g  (test for excess
errors)
PASS: gfortran.dg/class_transformational_2.f90   -O3 -g  execution test
PASS: gfortran.dg/class_transformational_2.f90   -Os  (test for excess
errors)
PASS: gfortran.dg/class_transformational_2.f90   -Os  execution test
The stop message in the full log indicates a numeric error in the first
test. I am unable to reproduce the error. Adding deallocation of all the
allocated variables (which I should have done in the first place) and
running valgrind with -s shows no errors and no memory loss.
I find it odd that it should fail once at -O1 and not at -O2 and higher.
Can you provide me with any insights; eg, by rerunning the testcase outside
of the dejagnu framework?
I can see the problem reliably when running the testcase binary for -O1
on an armv8l-linux-gnueabihf machine. Here's a GDB session showing where
it abruptly exits:
$ gdb -q class_transformational_2.exe
Reading symbols from class_transformational_2.exe...
(gdb) break check_spread
Breakpoint 1 at 0x10c72: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 54.
(gdb) r
Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2.exe
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Breakpoint 1, check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:54
54          stop_flag = 10
(gdb) n
55          a = [(s(j,10*j), j = 1,2)]
(gdb)
56          b = spread (a, dim = 2, ncopies = 2)
(gdb)
57          c = spread (b, dim = 1, ncopies = 4)
(gdb)
58          a = reshape (c, [size (c)])
(gdb) p c
$1 = ( i = (0, 1072693248), x = 1, d = 1 )
(gdb) n
STOP 12
[Inferior 1 (process 3684330) exited with code 014]
(gdb)
If I step into reshape, things seem to work fine, all the way to
_gfortrani_reshape_packed. If I then type "next" after the last
statement in that function, the process ends:
_gfortrani_reshape_packed (ret=0x252e0 "", rsize=128, source=0x25258 "\001\001\001\001", ssize=128, pad=0x0, psize=8) at /home/thiago.bauermann/src/gcc/libgfortran/intrinsics/reshape_packed.c:38
38        size = (rsize > ssize) ? ssize : rsize;
(gdb) n
39        memcpy (ret, source, size);
(gdb) n
42        while (rsize > 0)
(gdb) n
STOP 12
[Inferior 1 (process 3739928) exited with code 014]
(gdb)
If instead of typing "next", I type "step", then GDB enters realloc, and
some "MAIN__::__copy_MAIN___S" thing before moving to the next
line. Then it actually leaves the line with the reshape call and
proceeds further! It ends up exiting within check_result, line 48:
⋮
48              if (any (a%i .ne. ii)) stop stop_flag + 2
(gdb)
STOP 12
[Inferior 1 (process 3739974) exited with code 014]
(gdb)
So this seems to be a heisenbug, where the program behaves differently
in the presence of a debugger...
Just some baseless speculation: maybe the realloc call is failing? And
for some unknown reason, when doing the single-stepping in GDB it
succeeds? I can't think of anything else at least so far.
For comparison, here are sessions on a binary built with -O0:
$ gdb -q class_transformational_2-O0.exe
Reading symbols from class_transformational_2-O0.exe...
(gdb) break check_spread
Breakpoint 1 at 0x136e2: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 54.
(gdb) r
Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2-O0.exe
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Breakpoint 1, MAIN__::check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:54
54          stop_flag = 10
(gdb) n
55          a = [(s(j,10*j), j = 1,2)]
(gdb)
56          b = spread (a, dim = 2, ncopies = 2)
(gdb)
57          c = spread (b, dim = 1, ncopies = 4)
(gdb)
58          a = reshape (c, [size (c)])
(gdb) p c
$1 = ( _data = (((( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 )) (( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 ))) ((( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 )) (( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 )))), _vptr = 0x26174 <__vtab_MAIN___S.22> )
(gdb) n
59          ishape = [4,2,2]
(gdb) p a
$2 = ( _data = (( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 )), _vptr = 0x26174 <__vtab_MAIN___S.22> )
(gdb)
Note that 'c' is very different than in the -O1 case. Is that expected?
Now with a binary built with -O2:
$ gdb -q class_transformational_2-O2.exe
Reading symbols from class_transformational_2-O2.exe...
(gdb) start
Temporary breakpoint 1 at 0x10704: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 15.
Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2-O2.exe
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Temporary breakpoint 1, MAIN__ () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:15
15        class(t), allocatable :: scalar, a(:), aa(:), b(:,:), c(:,:,:), field(:,:,:)
(gdb) break 58
Breakpoint 2 at 0x10b34: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 58.
(gdb) c
Continuing.
Breakpoint 2, check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:58
58          a = reshape (c, [size (c)])
(gdb) p c
$1 = ( i = (0, 1072693248), x = 1, d = 1 )
(gdb) n
59          ishape = [4,2,2]
(gdb) p a
$2 = ( i = (0, 1045149306), x = 1.2904777690891933e-08, d = 1.2904777690891933e-08 )
(gdb)
Here, 'c' is the same as in the -O1 case.
If I run the "continue" GDB command, then the program completes
successfully. I wasn't able to break on check_spread this time because
that function isn't present in the optimized binary.
Another thing that I noticed is that the test occasionaly fails on
aarch64-linux, about 1 in 50 times when I run it repeatedly in a
loop. This happens with the "-O1", "-O2", "-O3" and "-O3
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
-finline-functions" variations. But not with the "-O0" variation.
Because the failure is intermittent, I wasn't able to run a debugger
when it happens yet. I'll try again next week with some scripting.
I tried reproducing on x86_64-linux, but couldn't.
I'm attaching the valgrind reports for arm and aarch64.
-- 
Thiago

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [Linaro-TCWG-CI] gcc patch #93154: FAIL: 1 regressions on arm