Hello Paul,
Paul Richard Thomas paul.richard.thomas@gmail.com writes:
Hi There,
I have been withholding the commit of this patch until I hear from you.
Sorry for the late response. I don't know much about Fortran or gfortran, but I tried to have a look at the failure. More details below, but unfortunately I didn't find anything concrete. Hopefully the Valgrind reports can help.
Please let me know if there are other tests or investigation I can make.
On Tue, 2 Jul 2024 at 08:48, Paul Richard Thomas < paul.richard.thomas@gmail.com> wrote:
Hi there,
You detected a failure in gfortran.dg/class_transformational_2.f90: PASS: gfortran.dg/class_transformational_2.f90 -O0 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O0 execution test PASS: gfortran.dg/class_transformational_2.f90 -O1 (test for excess errors) FAIL: gfortran.dg/class_transformational_2.f90 -O1 execution test PASS: gfortran.dg/class_transformational_2.f90 -O2 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O2 execution test PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -g (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O3 -g execution test PASS: gfortran.dg/class_transformational_2.f90 -Os (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -Os execution test
The stop message in the full log indicates a numeric error in the first test. I am unable to reproduce the error. Adding deallocation of all the allocated variables (which I should have done in the first place) and running valgrind with -s shows no errors and no memory loss.
I find it odd that it should fail once at -O1 and not at -O2 and higher. Can you provide me with any insights; eg, by rerunning the testcase outside of the dejagnu framework?
I can see the problem reliably when running the testcase binary for -O1 on an armv8l-linux-gnueabihf machine. Here's a GDB session showing where it abruptly exits:
$ gdb -q class_transformational_2.exe Reading symbols from class_transformational_2.exe... (gdb) break check_spread Breakpoint 1 at 0x10c72: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 54. (gdb) r Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2.exe [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Breakpoint 1, check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:54 54 stop_flag = 10 (gdb) n 55 a = [(s(j,10*j), j = 1,2)] (gdb) 56 b = spread (a, dim = 2, ncopies = 2) (gdb) 57 c = spread (b, dim = 1, ncopies = 4) (gdb) 58 a = reshape (c, [size (c)]) (gdb) p c $1 = ( i = (0, 1072693248), x = 1, d = 1 ) (gdb) n STOP 12 [Inferior 1 (process 3684330) exited with code 014] (gdb)
If I step into reshape, things seem to work fine, all the way to _gfortrani_reshape_packed. If I then type "next" after the last statement in that function, the process ends:
_gfortrani_reshape_packed (ret=0x252e0 "", rsize=128, source=0x25258 "\001\001\001\001", ssize=128, pad=0x0, psize=8) at /home/thiago.bauermann/src/gcc/libgfortran/intrinsics/reshape_packed.c:38 38 size = (rsize > ssize) ? ssize : rsize; (gdb) n 39 memcpy (ret, source, size); (gdb) n 42 while (rsize > 0) (gdb) n STOP 12 [Inferior 1 (process 3739928) exited with code 014] (gdb)
If instead of typing "next", I type "step", then GDB enters realloc, and some "MAIN__::__copy_MAIN___S" thing before moving to the next line. Then it actually leaves the line with the reshape call and proceeds further! It ends up exiting within check_result, line 48:
⋮ 48 if (any (a%i .ne. ii)) stop stop_flag + 2 (gdb) STOP 12 [Inferior 1 (process 3739974) exited with code 014] (gdb)
So this seems to be a heisenbug, where the program behaves differently in the presence of a debugger...
Just some baseless speculation: maybe the realloc call is failing? And for some unknown reason, when doing the single-stepping in GDB it succeeds? I can't think of anything else at least so far.
For comparison, here are sessions on a binary built with -O0:
$ gdb -q class_transformational_2-O0.exe Reading symbols from class_transformational_2-O0.exe... (gdb) break check_spread Breakpoint 1 at 0x136e2: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 54. (gdb) r Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2-O0.exe [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Breakpoint 1, MAIN__::check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:54 54 stop_flag = 10 (gdb) n 55 a = [(s(j,10*j), j = 1,2)] (gdb) 56 b = spread (a, dim = 2, ncopies = 2) (gdb) 57 c = spread (b, dim = 1, ncopies = 4) (gdb) 58 a = reshape (c, [size (c)]) (gdb) p c $1 = ( _data = (((( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 )) (( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 ))) ((( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 )) (( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 )))), _vptr = 0x26174 <__vtab_MAIN___S.22> ) (gdb) n 59 ishape = [4,2,2] (gdb) p a $2 = ( _data = (( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 )), _vptr = 0x26174 <__vtab_MAIN___S.22> ) (gdb)
Note that 'c' is very different than in the -O1 case. Is that expected?
Now with a binary built with -O2:
$ gdb -q class_transformational_2-O2.exe Reading symbols from class_transformational_2-O2.exe... (gdb) start Temporary breakpoint 1 at 0x10704: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 15. Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2-O2.exe [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Temporary breakpoint 1, MAIN__ () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:15 15 class(t), allocatable :: scalar, a(:), aa(:), b(:,:), c(:,:,:), field(:,:,:) (gdb) break 58 Breakpoint 2 at 0x10b34: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 58. (gdb) c Continuing.
Breakpoint 2, check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:58 58 a = reshape (c, [size (c)]) (gdb) p c $1 = ( i = (0, 1072693248), x = 1, d = 1 ) (gdb) n 59 ishape = [4,2,2] (gdb) p a $2 = ( i = (0, 1045149306), x = 1.2904777690891933e-08, d = 1.2904777690891933e-08 ) (gdb)
Here, 'c' is the same as in the -O1 case.
If I run the "continue" GDB command, then the program completes successfully. I wasn't able to break on check_spread this time because that function isn't present in the optimized binary.
Another thing that I noticed is that the test occasionaly fails on aarch64-linux, about 1 in 50 times when I run it repeatedly in a loop. This happens with the "-O1", "-O2", "-O3" and "-O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions" variations. But not with the "-O0" variation.
Because the failure is intermittent, I wasn't able to run a debugger when it happens yet. I'll try again next week with some scripting.
I tried reproducing on x86_64-linux, but couldn't.
I'm attaching the valgrind reports for arm and aarch64.