[Linaro-TCWG-CI] gcc patch #93154: FAIL: 1 regressions on arm

List overview All Threads
Download

newer

older

[Linaro-TCWG-CI]...

Re: [Linaro-TCWG-CI] 2 patches in...

Paul Richard Thomas

2 Jul 2024 2 Jul '24

7:48 a.m.

Hi there,

You detected a failure in gfortran.dg/class_transformational_2.f90: PASS: gfortran.dg/class_transformational_2.f90 -O0 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O0 execution test PASS: gfortran.dg/class_transformational_2.f90 -O1 (test for excess errors) FAIL: gfortran.dg/class_transformational_2.f90 -O1 execution test PASS: gfortran.dg/class_transformational_2.f90 -O2 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O2 execution test PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -g (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O3 -g execution test PASS: gfortran.dg/class_transformational_2.f90 -Os (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -Os execution test

The stop message in the full log indicates a numeric error in the first test. I am unable to reproduce the error. Adding deallocation of all the allocated variables (which I should have done in the first place) and running valgrind with -s shows no errors and no memory loss.

I find it odd that it should fail once at -O1 and not at -O2 and higher. Can you provide me with any insights; eg, by rerunning the testcase outside of the dejagnu framework?

Thank you for doing this testing, by the way, even if the failure is a bit obscure at the moment.

Best regards

Paul

Show replies by date

Paul Richard Thomas

5 Jul 5 Jul

3:18 p.m.

Hi There,

I have been withholding the commit of this patch until I hear from you.

Regards

Paul

On Tue, 2 Jul 2024 at 08:48, Paul Richard Thomas < paul.richard.thomas@gmail.com> wrote:

...

Hi there,

You detected a failure in gfortran.dg/class_transformational_2.f90: PASS: gfortran.dg/class_transformational_2.f90 -O0 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O0 execution test PASS: gfortran.dg/class_transformational_2.f90 -O1 (test for excess errors) FAIL: gfortran.dg/class_transformational_2.f90 -O1 execution test PASS: gfortran.dg/class_transformational_2.f90 -O2 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O2 execution test PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -g (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O3 -g execution test PASS: gfortran.dg/class_transformational_2.f90 -Os (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -Os execution test

The stop message in the full log indicates a numeric error in the first test. I am unable to reproduce the error. Adding deallocation of all the allocated variables (which I should have done in the first place) and running valgrind with -s shows no errors and no memory loss.

I find it odd that it should fail once at -O1 and not at -O2 and higher. Can you provide me with any insights; eg, by rerunning the testcase outside of the dejagnu framework?

Thank you for doing this testing, by the way, even if the failure is a bit obscure at the moment.

Best regards

Paul

Thiago Jung Bauermann

6 Jul 6 Jul

5:26 a.m.

Hello Paul,

Paul Richard Thomas paul.richard.thomas@gmail.com writes:

...

Hi There,

I have been withholding the commit of this patch until I hear from you.

Sorry for the late response. I don't know much about Fortran or gfortran, but I tried to have a look at the failure. More details below, but unfortunately I didn't find anything concrete. Hopefully the Valgrind reports can help.

Please let me know if there are other tests or investigation I can make.

...

On Tue, 2 Jul 2024 at 08:48, Paul Richard Thomas < paul.richard.thomas@gmail.com> wrote:

...
Hi there,

You detected a failure in gfortran.dg/class_transformational_2.f90: PASS: gfortran.dg/class_transformational_2.f90 -O0 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O0 execution test PASS: gfortran.dg/class_transformational_2.f90 -O1 (test for excess errors) FAIL: gfortran.dg/class_transformational_2.f90 -O1 execution test PASS: gfortran.dg/class_transformational_2.f90 -O2 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O2 execution test PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -g (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O3 -g execution test PASS: gfortran.dg/class_transformational_2.f90 -Os (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -Os execution test

The stop message in the full log indicates a numeric error in the first test. I am unable to reproduce the error. Adding deallocation of all the allocated variables (which I should have done in the first place) and running valgrind with -s shows no errors and no memory loss.

I find it odd that it should fail once at -O1 and not at -O2 and higher. Can you provide me with any insights; eg, by rerunning the testcase outside of the dejagnu framework?

I can see the problem reliably when running the testcase binary for -O1 on an armv8l-linux-gnueabihf machine. Here's a GDB session showing where it abruptly exits:

$ gdb -q class_transformational_2.exe Reading symbols from class_transformational_2.exe... (gdb) break check_spread Breakpoint 1 at 0x10c72: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 54. (gdb) r Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2.exe [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".

Breakpoint 1, check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:54 54 stop_flag = 10 (gdb) n 55 a = [(s(j,10*j), j = 1,2)] (gdb) 56 b = spread (a, dim = 2, ncopies = 2) (gdb) 57 c = spread (b, dim = 1, ncopies = 4) (gdb) 58 a = reshape (c, [size (c)]) (gdb) p c $1 = ( i = (0, 1072693248), x = 1, d = 1 ) (gdb) n STOP 12 [Inferior 1 (process 3684330) exited with code 014] (gdb)

If I step into reshape, things seem to work fine, all the way to _gfortrani_reshape_packed. If I then type "next" after the last statement in that function, the process ends:

_gfortrani_reshape_packed (ret=0x252e0 "", rsize=128, source=0x25258 "\001\001\001\001", ssize=128, pad=0x0, psize=8) at /home/thiago.bauermann/src/gcc/libgfortran/intrinsics/reshape_packed.c:38 38 size = (rsize > ssize) ? ssize : rsize; (gdb) n 39 memcpy (ret, source, size); (gdb) n 42 while (rsize > 0) (gdb) n STOP 12 [Inferior 1 (process 3739928) exited with code 014] (gdb)

If instead of typing "next", I type "step", then GDB enters realloc, and some "MAIN__::__copy_MAIN___S" thing before moving to the next line. Then it actually leaves the line with the reshape call and proceeds further! It ends up exiting within check_result, line 48:

⋮ 48 if (any (a%i .ne. ii)) stop stop_flag + 2 (gdb) STOP 12 [Inferior 1 (process 3739974) exited with code 014] (gdb)

So this seems to be a heisenbug, where the program behaves differently in the presence of a debugger...

Just some baseless speculation: maybe the realloc call is failing? And for some unknown reason, when doing the single-stepping in GDB it succeeds? I can't think of anything else at least so far.

For comparison, here are sessions on a binary built with -O0:

$ gdb -q class_transformational_2-O0.exe Reading symbols from class_transformational_2-O0.exe... (gdb) break check_spread Breakpoint 1 at 0x136e2: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 54. (gdb) r Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2-O0.exe [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".

Breakpoint 1, MAIN__::check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:54 54 stop_flag = 10 (gdb) n 55 a = [(s(j,10*j), j = 1,2)] (gdb) 56 b = spread (a, dim = 2, ncopies = 2) (gdb) 57 c = spread (b, dim = 1, ncopies = 4) (gdb) 58 a = reshape (c, [size (c)]) (gdb) p c $1 = ( _data = (((( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 )) (( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 ))) ((( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 )) (( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 )))), _vptr = 0x26174 <__vtab_MAIN___S.22> ) (gdb) n 59 ishape = [4,2,2] (gdb) p a $2 = ( _data = (( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 )), _vptr = 0x26174 <__vtab_MAIN___S.22> ) (gdb)

Note that 'c' is very different than in the -O1 case. Is that expected?

Now with a binary built with -O2:

$ gdb -q class_transformational_2-O2.exe Reading symbols from class_transformational_2-O2.exe... (gdb) start Temporary breakpoint 1 at 0x10704: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 15. Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2-O2.exe [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".

Temporary breakpoint 1, MAIN__ () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:15 15 class(t), allocatable :: scalar, a(:), aa(:), b(:,:), c(:,:,:), field(:,:,:) (gdb) break 58 Breakpoint 2 at 0x10b34: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 58. (gdb) c Continuing.

Breakpoint 2, check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:58 58 a = reshape (c, [size (c)]) (gdb) p c $1 = ( i = (0, 1072693248), x = 1, d = 1 ) (gdb) n 59 ishape = [4,2,2] (gdb) p a $2 = ( i = (0, 1045149306), x = 1.2904777690891933e-08, d = 1.2904777690891933e-08 ) (gdb)

Here, 'c' is the same as in the -O1 case.

If I run the "continue" GDB command, then the program completes successfully. I wasn't able to break on check_spread this time because that function isn't present in the optimized binary.

Another thing that I noticed is that the test occasionaly fails on aarch64-linux, about 1 in 50 times when I run it repeatedly in a loop. This happens with the "-O1", "-O2", "-O3" and "-O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions" variations. But not with the "-O0" variation.

Because the failure is intermittent, I wasn't able to run a debugger when it happens yet. I'll try again next week with some scripting.

I tried reproducing on x86_64-linux, but couldn't.

I'm attaching the valgrind reports for arm and aarch64.

-- Thiago

Thiago Jung Bauermann

5:55 a.m.

Hello,

One more detail:

Thiago Jung Bauermann thiago.bauermann@linaro.org writes:

...

I can see the problem reliably when running the testcase binary for -O1 on an armv8l-linux-gnueabihf machine.

I ran your patch through a different CI loop that we have, where instead of using the distro's toolchain (binutils, gcc, glibc) to build and test the patch, it builds every component from scratch and from their respective tips of trunk.

This time it didn't detect any problem. All gfortran.dg/class_transformational_2.f90 tests passed:

https://ci.linaro.org/job/tcwg_gnu_native_check_gcc--master-arm-precommit/2/...

I think this means that with Ubuntu 22.04 glibc we see the problem, but when using the latest upstream glibc we don't.

-- Thiago

Paul Richard Thomas

7 Jul 7 Jul

6:05 a.m.

Hi Thiago,

Thank you very much for your debugging efforts. You really pulled out the stops.

Can I take it then that you will update the toolchain system wide so that I can commit the patch without triggering you every night? It would be a pity to XFAIL it after your efforts.

I thought that since the failure occurred at -O1 only, it must have been one of those sporadic, random failures that, as far as I can tell, are due to the system deciding that it has something more important to do than run the testsuite

Best regards

Paul

On Sat, 6 Jul 2024 at 06:55, Thiago Jung Bauermann < thiago.bauermann@linaro.org> wrote:

...

Hello,

One more detail:

Thiago Jung Bauermann thiago.bauermann@linaro.org writes:

...
I can see the problem reliably when running the testcase binary for -O1 on an armv8l-linux-gnueabihf machine.

I ran your patch through a different CI loop that we have, where instead of using the distro's toolchain (binutils, gcc, glibc) to build and test the patch, it builds every component from scratch and from their respective tips of trunk.

This time it didn't detect any problem. All gfortran.dg/class_transformational_2.f90 tests passed:

https://ci.linaro.org/job/tcwg_gnu_native_check_gcc--master-arm-precommit/2/...

I think this means that with Ubuntu 22.04 glibc we see the problem, but when using the latest upstream glibc we don't.

-- Thiago

Thiago Jung Bauermann

9 Jul 9 Jul

3:13 a.m.

Hello Paul,

Paul Richard Thomas paul.richard.thomas@gmail.com writes:

...

Thank you very much for your debugging efforts. You really pulled out the stops.

You're welcome. In the future if there are other issues or questions regarding our CI, please feel free to contact us.

...

Can I take it then that you will update the toolchain system wide so that I can commit the patch without triggering you every night? It would be a pity to XFAIL it after your efforts.

Now that there's a new Ubuntu LTS I believe we will update our systems to it in the near feature, but I'm not sure exactly when.

In any case, committing your patch won't be a problem because we only report a regression once. The commit will trigger a new notification email because it will be the first time that the problem will be detected in trunk, but at that point our system will incorporate that FAIL into its known failures and not complain about it in the future.

...

On Sat, 6 Jul 2024 at 06:55, Thiago Jung Bauermann thiago.bauermann@linaro.org wrote:

I ran your patch through a different CI loop that we have, where instead of using the distro's toolchain (binutils, gcc, glibc) to build and test the patch, it builds every component from scratch and from their respective tips of trunk.

This time it didn't detect any problem. All gfortran.dg/class_transformational_2.f90 tests passed:

https://ci.linaro.org/job/tcwg_gnu_native_check_gcc--master-arm-precommit/2/...

I think this means that with Ubuntu 22.04 glibc we see the problem, but when using the latest upstream glibc we don't.

I ran the test on the same machine but inside a container with Ubuntu 24.04 and I couldn't reproduce the FAIL there, so this confirms my suspicion: the problem is in the system toolchain, likely in glibc.

-- Thiago

Paul Richard Thomas

5:46 a.m.

Many thanks for the comprehensive reply, Thiago.

As it happens, running valgrind with -s on both new testcases, indicates problems emanating from one line in the other test, class_transformational_1.f90. I am investigating and will put it right.

Regards

Paul

On Tue, 9 Jul 2024 at 04:13, Thiago Jung Bauermann < thiago.bauermann@linaro.org> wrote:

...

Hello Paul,

Paul Richard Thomas paul.richard.thomas@gmail.com writes:

...
Thank you very much for your debugging efforts. You really pulled out

the stops.

You're welcome. In the future if there are other issues or questions regarding our CI, please feel free to contact us.

...
Can I take it then that you will update the toolchain system wide so

that I can commit the patch

...
without triggering you every night? It would be a pity to XFAIL it after

your efforts.

Now that there's a new Ubuntu LTS I believe we will update our systems to it in the near feature, but I'm not sure exactly when.

In any case, committing your patch won't be a problem because we only report a regression once. The commit will trigger a new notification email because it will be the first time that the problem will be detected in trunk, but at that point our system will incorporate that FAIL into its known failures and not complain about it in the future.

...
On Sat, 6 Jul 2024 at 06:55, Thiago Jung Bauermann <

thiago.bauermann@linaro.org> wrote:

...
I ran your patch through a different CI loop that we have, where instead of using the distro's toolchain (binutils, gcc, glibc) to build and test the patch, it builds every component from scratch and from their respective tips of trunk.

This time it didn't detect any problem. All gfortran.dg/class_transformational_2.f90 tests passed:

https://ci.linaro.org/job/tcwg_gnu_native_check_gcc--master-arm-precommit/2/...

...
I think this means that with Ubuntu 22.04 glibc we see the problem, but when using the latest upstream glibc we don't.

I ran the test on the same machine but inside a container with Ubuntu 24.04 and I couldn't reproduce the FAIL there, so this confirms my suspicion: the problem is in the system toolchain, likely in glibc.

-- Thiago

392

days inactive

399

days old

linaro-toolchain@lists.linaro.org

6 comments

participants

tags (0)

participants (2)

Paul Richard Thomas
Thiago Jung Bauermann