As discussed before, I tried using the rebootstrap tool [1] to see what problems come up once the entire distro gets rebuilt. Based on Lukasz' recommendation, I tried the 'y2038_edge' branch with his experimental glibc patches [2], using commit c2de7ee9461 dated 2020-02-17.
Here is a rough summary of what I tried, what worked, and what problems I ran into:
* Building a Debian package from this was fairly straightforward, using the 2.31 branch in the package git repository[3] after replacing the debian/patches/git-updates.diff file with one generated from [2] and disabling the hurd patches because of conflicts.
* After installing the modified x86 glibc package, I ran into a runtime bug in [4], which needs to pass AT_FDCWD instead of 0 to avoid ENOTDIR errors.
* Bootstrapping a regular time32 Debian armhf with this libc took me a few days to get right, but that was mostly for getting familiar with rebootstrap and running into known issues unrelated to time64 or the glibc changes.
* Actually building a time64 version of glibc turned out to be harder, including some issues discussed on the libc mailing list[5]:
- Always setting -D_TIME_BITS=64 in the global compiler flags for the distro breaks both the native 64-bit (x86_64) build and the 32-bit build, as glibc itself expects to be built without this.
- Removing the time32 symbols from the glibc shared object did not work as they are still used (a lot) internally, and by the testsuite.
- I tried converting all the internal symbols to use the time64 variants with the correct types (e.g. __clock_gettime64() instead of __clock_gettime()), but then ran into a lot of APIs that take timespec/timeval/... arguments and pass them down into internal functions. These seem to all be bugs that require adding a time64 version of the external ABI.
- After I abandoned that approach, I continued with a simple patch to features.h that sets _TIME_BITS/_FILE_OFFSET_BITS based on '#if !defined _LIBC && __TIMESIZE == 32', which ignores the bugs I found earlier but got me a lot further.
- Building the i386 glibc with that patch, I ran into over 150 testsuite failures [6]. This looked like there was a fundamental mistake on my side, but after I looked into a few of the failures, most seemed to be either glibc or testsuite bugs that have to be addressed individually. I considered giving up at this point, but as Lukasz has said that he had successfully built a working system using Yocto, I kept going anyway and marked these all as expected failures in the debian package.
* There are a couple of noteworthy issues in glibc-y2038 I'd like to point out in particular, though I'm sure these are not the only important ones:
- The clock_nanosleep() prototype needed a '__THROW' annotation to complete the build.
- The nptl and sunrpc portions have numerous interfaces with 'timeval' or 'timespec' arguments that each cause an ABI break.
- stat()/fstat()/lstat(), nanosleep(), wait3()/wait4(), ppoll_chk() are some of the other interfaces that take a time_t based argument and need to grow a time64 version to avoid an ABI mismatch.
- The timeval prototype appears to be broken, as it's missing padding on architectures without native alignment of __time64 (e.g. i386) and on all big-endian architectures.
- some testcases hang in futex_wait() or clock_nanosleep() because of incorrect timeout arguments, presumably from type mismatches.
* There is an open question regarding the name of the Debian architecture. For my experiments, I kept using the 'armhf' name unmodified, though there seems to be a general feeling that using a different name would be required to address the broad incompatibilities between time32 and time64 versions of all the libraries in the distro. Gradually changing them won't work because of the timeline and the number of affected libraries. However, the new name of the distro also implies having a distinct target triplet, which must then be known by glibc along with everything else using config.guess/config.sub. I expect this topic to require a lot more discussion.
* Continuing with the rebootstrap build despite the known glibc issues and the open question on the architecture name went surprisingly well, only two out of the 152 source packages I built had compile-time problems:
- building the final gcc failed in libsanitizer, which has compile-time checks to ensure some libc data structures have the expected layout. It noticed that 'struct timeb' and 'struct dirent' are different based on _TIME_BITS and _FILE_OFFSET_BITS. I disabled the checks to be able to continue. To this properly, the library has to learn about the new data structures as well. I opened a bug report against the library[7].
- libpreludecpp12 failed to build because of checks for changes in the exported functions, which are different with time64. I disabled the checks. Once we have agreed on a new debian architecture name, the symbols can be made arch specific.
* After everything was built, I tried installing the packages into a chroot with qemu-debootstrap, which failed because I had configured the glibc to assume it's running on a new kernel while the qemu-user binary I had lacks the new syscalls. I believe this is fixed in upstream qemu, but did not try that.
* Trying to install again I used a clean debian-arm64 installation running in qemu-system-aarch64, and attempted installing the armhf packages using a regular debootstrap, running the 32-bit binaries in compat mode of a recent arm64 kernel. This partially worked and I could chroot into the system and use a shell, but ultimately the debootstrap did not complete because of errors. I saw that 'tar' had failed because of the stat() glibc ABI mismatch breaking its private gnulib fdutimens() implementation, and this is where I gave up.
I have spent more time on this now than I had planned, and don't expect to do further work on it anytime soon, but I hope my summary is useful to others that are going to need this later. I can obviously share my patches and build artifacts if anyone needs them. There are two additional approaches that would likely get a Debian bootstrap further, but that I have not tried as they were previously dismissed:
* Adding a time64 armhf as a separate (incompatible) target in glibc that defines __TIMESIZE==64 and a 64-bit __time_t would avoid most of the remaining ABI issues and put armhf-time64 in the same category as riscv32 and arc, but this idea was so far rejected by the glibc maintainers. Depending on how hard this turns out to be, it could be used to get to the point of self-hosting though, and help find time64 related bugs in the rest of the distro.
* Doing the bootstrap using a musleabihf target instead of gnueabihf would avoid the current issues internal to glibc-y2038, but instead lead to new problems with packages that do not currently work with musl. Adelie Linux has shown that it's already possible to build a useful distro using musl and time64[8], and this would sidestep the question of the target triplet. While it would also help find and fix additional bugs in packages, and make an interesting unoffical Debian target, I don't see it replacing the existing armhf port any time soon.
For additional information about the Debian plans, see the article on LWN[9] that summarizes the discussion started by Steve McIntyre [10].
Arnd
[1] https://wiki.debian.org/HelmutGrohne/rebootstrap [2] https://github.com/lmajewski/y2038_glibc/tree/y2038_edge [3] https://salsa.debian.org/glibc-team/glibc/-/tree/glibc-2.31 [4] https://github.com/lmajewski/y2038_glibc/commit/2f72ea2b6f6ee [5] https://sourceware.org/pipermail/libc-alpha/2020-February/111375.html [6] https://pastebin.com/fJYV2stF [7] https://bugs.llvm.org/show_bug.cgi?id=45138 [8] https://wiki.adelielinux.org/wiki/Project:Time64 [9] https://lwn.net/Articles/812767/ [10] https://lwn.net/ml/debian-devel/20200204131410.GF3043@tack.einval.com/
On Wed, 11 Mar 2020 13:52:00 +01000, Arnd Bergmann wrote:
As discussed before, I tried using the rebootstrap tool [1] to see what problems come up once the entire distro gets rebuilt. Based on Lukasz' recommendation, I tried the 'y2038_edge' branch with his experimental glibc patches [2], using commit c2de7ee9461 dated 2020-02-17.
Here is a rough summary of what I tried, what worked, and what problems I ran into:
[...]
Actually building a time64 version of glibc turned out to be harder, including some issues discussed on the libc mailing list[5]:
- Always setting -D_TIME_BITS=64 in the global compiler flags for the distro breaks both the native 64-bit (x86_64) build and the 32-bit build, as glibc itself expects to be built without this.
This seems like a small issue, but glibc should probably either remove it from CFLAGS in the build system or at least catch it at configure time and error out, so that it's not confusing when it breaks.
- Removing the time32 symbols from the glibc shared object did not work as they are still used (a lot) internally, and by the testsuite.
That they're used internally sounds like a major problem; anywhere they're being used internally potentially has hidden Y2038 bugs. This is also why I'm concerned about glibc's approach of not building itself with _TIME_BITS=64, and just undefining it or doing something else in the wrapper files for the legacy time32 symbols.
- I tried converting all the internal symbols to use the time64 variants with the correct types (e.g. __clock_gettime64() instead of __clock_gettime()), but then ran into a lot of APIs that take timespec/timeval/... arguments and pass them down into internal functions. These seem to all be bugs that require adding a time64 version of the external ABI.
This also sounds bad. The set of functions that need time64 versions has little to do with the syscalls that needed changing, and rather is a matter of which functions have time_t-derived types in their public interfaces. I think it would be useful to compare current glibc patches against musl's "nm -D libc.so | grep time64", which has 63 lines. There may be more functions glibc needs to have time64 versions of because of of additional functionality it supports, but if it's lacking any of the ones musl has, that's probably indicative of a bug. I'm attaching the list.
- The nptl and sunrpc portions have numerous interfaces with 'timeval' or 'timespec' arguments that each cause an ABI break.
nptl is essential but I think sunrpc is pure legacy ABI and not intended to be linkable in the future.
- stat()/fstat()/lstat(), nanosleep(), wait3()/wait4(), ppoll_chk() are some of the other interfaces that take a time_t based argument and need to grow a time64 version to avoid an ABI mismatch.
And this requires a decision whether to keep the __xstat framework with a new _STAT_VER or make a new symbol.
I have spent more time on this now than I had planned, and don't expect to do further work on it anytime soon, but I hope my summary is useful to others that are going to need this later. I can obviously share my patches and build artifacts if anyone needs them. There are two additional approaches that would likely get a Debian bootstrap further, but that I have not tried as they were previously dismissed:
It's really amazing how much time you put into this. Thank you!!
Rich
On Fri, Mar 13, 2020 at 9:22 PM Rich Felker dalias@libc.org wrote:
On Wed, 11 Mar 2020 13:52:00 +01000, Arnd Bergmann wrote:
As discussed before, I tried using the rebootstrap tool [1] to see what problems come up once the entire distro gets rebuilt. Based on Lukasz' recommendation, I tried the 'y2038_edge' branch with his experimental glibc patches [2], using commit c2de7ee9461 dated 2020-02-17.
Here is a rough summary of what I tried, what worked, and what problems I ran into:
[...]
Actually building a time64 version of glibc turned out to be harder, including some issues discussed on the libc mailing list[5]:
- Always setting -D_TIME_BITS=64 in the global compiler flags for the distro breaks both the native 64-bit (x86_64) build and the 32-bit build, as glibc itself expects to be built without this.
This seems like a small issue, but glibc should probably either remove it from CFLAGS in the build system or at least catch it at configure time and error out, so that it's not confusing when it breaks.
Right, that would make sense. For the test suite though, I guess it would actually need to run each test case that references time_t both ways.
- Removing the time32 symbols from the glibc shared object did not work as they are still used (a lot) internally, and by the testsuite.
That they're used internally sounds like a major problem; anywhere they're being used internally potentially has hidden Y2038 bugs. This is also why I'm concerned about glibc's approach of not building itself with _TIME_BITS=64, and just undefining it or doing something else in the wrapper files for the legacy time32 symbols.
I thought this was the long-term plan. Working on the ABI first and then changing the implementation may help speed up the timeline before distro-level work can start, but OTOH removing all the 32-bit codepaths from the implementation first makes it more likely to find all relevant bits.
- The nptl and sunrpc portions have numerous interfaces with 'timeval' or 'timespec' arguments that each cause an ABI break.
nptl is essential but I think sunrpc is pure legacy ABI and not intended to be linkable in the future.
That would be helpful, but what does it mean for distro packages that link against it today? codesearch.debian.org e.g. finds nfs-utls, nis, libtirpc, ntirpc and nfswatch including <rpc/*.h>. Can these just use a replacement that is built with 64-bit time_t then?
Arnd
On Mon, Mar 16, 2020 at 03:28:43PM +0100, Arnd Bergmann wrote:
On Fri, Mar 13, 2020 at 9:22 PM Rich Felker dalias@libc.org wrote:
On Wed, 11 Mar 2020 13:52:00 +01000, Arnd Bergmann wrote:
As discussed before, I tried using the rebootstrap tool [1] to see what problems come up once the entire distro gets rebuilt. Based on Lukasz' recommendation, I tried the 'y2038_edge' branch with his experimental glibc patches [2], using commit c2de7ee9461 dated 2020-02-17.
Here is a rough summary of what I tried, what worked, and what problems I ran into:
[...]
Actually building a time64 version of glibc turned out to be harder, including some issues discussed on the libc mailing list[5]:
- Always setting -D_TIME_BITS=64 in the global compiler flags for the distro breaks both the native 64-bit (x86_64) build and the 32-bit build, as glibc itself expects to be built without this.
This seems like a small issue, but glibc should probably either remove it from CFLAGS in the build system or at least catch it at configure time and error out, so that it's not confusing when it breaks.
Right, that would make sense. For the test suite though, I guess it would actually need to run each test case that references time_t both ways.
Indeed.
- Removing the time32 symbols from the glibc shared object did not work as they are still used (a lot) internally, and by the testsuite.
That they're used internally sounds like a major problem; anywhere they're being used internally potentially has hidden Y2038 bugs. This is also why I'm concerned about glibc's approach of not building itself with _TIME_BITS=64, and just undefining it or doing something else in the wrapper files for the legacy time32 symbols.
I thought this was the long-term plan. Working on the ABI first and then changing the implementation may help speed up the timeline before distro-level work can start, but OTOH removing all the 32-bit codepaths from the implementation first makes it more likely to find all relevant bits.
In my experience it was easiest to do *with* the aid of the public header redirections applying internally to libc as well. I don't really understand how glibc is trying to make this easier by avoiding that.
- The nptl and sunrpc portions have numerous interfaces with 'timeval' or 'timespec' arguments that each cause an ABI break.
nptl is essential but I think sunrpc is pure legacy ABI and not intended to be linkable in the future.
That would be helpful, but what does it mean for distro packages that link against it today? codesearch.debian.org e.g. finds nfs-utls, nis, libtirpc, ntirpc and nfswatch including <rpc/*.h>. Can these just use a replacement that is built with 64-bit time_t then?
libtirpc is the replacement. I wasn't aware if uses libc-provided rpc headers (presumably only if they exist, since folks are using it fine on musl) but even if so I think the types will automatically update when time_t changes. Of course that leaves the libtirpc ABI dependent on which time_t is used.
Rich
On Mon, Mar 16, 2020 at 3:47 PM Rich Felker dalias@libc.org wrote:
libtirpc is the replacement. I wasn't aware if uses libc-provided rpc headers (presumably only if they exist, since folks are using it fine on musl) but even if so I think the types will automatically update when time_t changes. Of course that leaves the libtirpc ABI dependent on which time_t is used.
Ok, makes sense. I suppose it just provides a header with the same name then.
Arnd
On Mon, 2020-03-16 at 16:02 +0100, Arnd Bergmann wrote:
On Mon, Mar 16, 2020 at 3:47 PM Rich Felker dalias@libc.org wrote:
libtirpc is the replacement. I wasn't aware if uses libc-provided rpc headers (presumably only if they exist, since folks are using it fine on musl) but even if so I think the types will automatically update when time_t changes. Of course that leaves the libtirpc ABI dependent on which time_t is used.
Ok, makes sense. I suppose it just provides a header with the same name then.
* nfs-utils build-depends on libtirpc-dev, and isn't using the glibc SunRPC headers except for <rpc/netdb.h>. libtirpc's <rpc/rpcent.h> specifically avoids declaring things that are also declared in glibc's <rpc/netdb.h>.
* ntirpc is a different port of the SunRPC code, used by nfs-ganesha.
* nis and nfswatch really are using the glibc SunRPC headers.
Ben.
* Ben Hutchings:
On Mon, 2020-03-16 at 16:02 +0100, Arnd Bergmann wrote:
On Mon, Mar 16, 2020 at 3:47 PM Rich Felker dalias@libc.org wrote:
libtirpc is the replacement. I wasn't aware if uses libc-provided rpc headers (presumably only if they exist, since folks are using it fine on musl) but even if so I think the types will automatically update when time_t changes. Of course that leaves the libtirpc ABI dependent on which time_t is used.
Ok, makes sense. I suppose it just provides a header with the same name then.
- nfs-utils build-depends on libtirpc-dev, and isn't using the glibc
SunRPC headers except for <rpc/netdb.h>. libtirpc's <rpc/rpcent.h> specifically avoids declaring things that are also declared in glibc's <rpc/netdb.h>.
ntirpc is a different port of the SunRPC code, used by nfs-ganesha.
nis and nfswatch really are using the glibc SunRPC headers.
Which part of NIS? There's a new upstream for libnsl https://github.com/thkukuk/libnsl and the NSS module https://github.com/thkukuk/libnss_nis. (There is a nisplus module as well.)
All these use libtirpc and support IPv6 in addition to IPv4. As far as I know, it is possible to build a full NIS stack without relying on any of the legacy glibc code.
(I don't know about nfswatch.)
On Fri, 2020-03-20 at 00:09 +0100, Florian Weimer wrote:
- Ben Hutchings:
On Mon, 2020-03-16 at 16:02 +0100, Arnd Bergmann wrote:
On Mon, Mar 16, 2020 at 3:47 PM Rich Felker dalias@libc.org wrote:
libtirpc is the replacement. I wasn't aware if uses libc-provided rpc headers (presumably only if they exist, since folks are using it fine on musl) but even if so I think the types will automatically update when time_t changes. Of course that leaves the libtirpc ABI dependent on which time_t is used.
Ok, makes sense. I suppose it just provides a header with the same name then.
- nfs-utils build-depends on libtirpc-dev, and isn't using the glibc
SunRPC headers except for <rpc/netdb.h>. libtirpc's <rpc/rpcent.h> specifically avoids declaring things that are also declared in glibc's <rpc/netdb.h>.
ntirpc is a different port of the SunRPC code, used by nfs-ganesha.
nis and nfswatch really are using the glibc SunRPC headers.
Which part of NIS? There's a new upstream for libnsl https://github.com/thkukuk/libnsl and the NSS module https://github.com/thkukuk/libnss_nis;. (There is a nisplus module as well.)
This is Debian's "nis" source package, which is a bundle of yp-tools, ypserv, and ypbind-mt from the same upstream author. It's unmaintained and has lots of bug reports in Debian.
All these use libtirpc and support IPv6 in addition to IPv4. As far as I know, it is possible to build a full NIS stack without relying on any of the legacy glibc code.
(I don't know about nfswatch.)
The upstream for that is https://sourceforge.net/projects/nfswatch/. The current Fedora package is patched to use libtirpc.
Ben.
Hey Arnd,
Catching up on this thread a little late, sorry... :-/
On Wed, Mar 11, 2020 at 01:52:00PM +0100, Arnd Bergmann wrote:
As discussed before, I tried using the rebootstrap tool [1] to see what problems come up once the entire distro gets rebuilt. Based on Lukasz' recommendation, I tried the 'y2038_edge' branch with his experimental glibc patches [2], using commit c2de7ee9461 dated 2020-02-17.
Here is a rough summary of what I tried, what worked, and what problems I ran into:
- Building a Debian package from this was fairly straightforward, using
the 2.31 branch in the package git repository[3] after replacing the debian/patches/git-updates.diff file with one generated from [2] and disabling the hurd patches because of conflicts.
- After installing the modified x86 glibc package, I ran into a runtime
bug in [4], which needs to pass AT_FDCWD instead of 0 to avoid ENOTDIR errors.
- Bootstrapping a regular time32 Debian armhf with this libc took me
a few days to get right, but that was mostly for getting familiar with rebootstrap and running into known issues unrelated to time64 or the glibc changes.
Cool!
<snip glibc questions>
- There is an open question regarding the name of the Debian
architecture. For my experiments, I kept using the 'armhf' name unmodified, though there seems to be a general feeling that using a different name would be required to address the broad incompatibilities between time32 and time64 versions of all the libraries in the distro. Gradually changing them won't work because of the timeline and the number of affected libraries. However, the new name of the distro also implies having a distinct target triplet, which must then be known by glibc along with everything else using config.guess/config.sub. I expect this topic to require a lot more discussion.
ACK. I'm about to prod on this again.
- Continuing with the rebootstrap build despite the known glibc issues
and the open question on the architecture name went surprisingly well, only two out of the 152 source packages I built had compile-time problems:
building the final gcc failed in libsanitizer, which has compile-time checks to ensure some libc data structures have the expected layout. It noticed that 'struct timeb' and 'struct dirent' are different based on _TIME_BITS and _FILE_OFFSET_BITS. I disabled the checks to be able to continue. To this properly, the library has to learn about the new data structures as well. I opened a bug report against the library[7].
libpreludecpp12 failed to build because of checks for changes in the exported functions, which are different with time64. I disabled the checks. Once we have agreed on a new debian architecture name, the symbols can be made arch specific.
Yup.
- After everything was built, I tried installing the packages into
a chroot with qemu-debootstrap, which failed because I had configured the glibc to assume it's running on a new kernel while the qemu-user binary I had lacks the new syscalls. I believe this is fixed in upstream qemu, but did not try that.
- Trying to install again I used a clean debian-arm64 installation
running in qemu-system-aarch64, and attempted installing the armhf packages using a regular debootstrap, running the 32-bit binaries in compat mode of a recent arm64 kernel. This partially worked and I could chroot into the system and use a shell, but ultimately the debootstrap did not complete because of errors. I saw that 'tar' had failed because of the stat() glibc ABI mismatch breaking its private gnulib fdutimens() implementation, and this is where I gave up.
Nod. :-/ I think it's time that somebody else picked up from you here.
I have spent more time on this now than I had planned, and don't expect to do further work on it anytime soon, but I hope my summary is useful to others that are going to need this later. I can obviously share my patches and build artifacts if anyone needs them. There are two additional approaches that would likely get a Debian bootstrap further, but that I have not tried as they were previously dismissed:
- Adding a time64 armhf as a separate (incompatible) target in glibc
that defines __TIMESIZE==64 and a 64-bit __time_t would avoid most of the remaining ABI issues and put armhf-time64 in the same category as riscv32 and arc, but this idea was so far rejected by the glibc maintainers. Depending on how hard this turns out to be, it could be used to get to the point of self-hosting though, and help find time64 related bugs in the rest of the distro.
OK. I'm thinking it's probably not worth it?
- Doing the bootstrap using a musleabihf target instead of gnueabihf
would avoid the current issues internal to glibc-y2038, but instead lead to new problems with packages that do not currently work with musl. Adelie Linux has shown that it's already possible to build a useful distro using musl and time64[8], and this would sidestep the question of the target triplet. While it would also help find and fix additional bugs in packages, and make an interesting unoffical Debian target, I don't see it replacing the existing armhf port any time soon.
Ditto.
Thanks for the great summary of what you've been working on!
On Mon, Mar 23, 2020 at 7:21 PM Steve McIntyre steve@einval.com wrote:
On Wed, Mar 11, 2020 at 01:52:00PM +0100, Arnd Bergmann wrote:
- Adding a time64 armhf as a separate (incompatible) target in glibc
that defines __TIMESIZE==64 and a 64-bit __time_t would avoid most of the remaining ABI issues and put armhf-time64 in the same category as riscv32 and arc, but this idea was so far rejected by the glibc maintainers. Depending on how hard this turns out to be, it could be used to get to the point of self-hosting though, and help find time64 related bugs in the rest of the distro.
OK. I'm thinking it's probably not worth it?
This depends on the timeline of Lukasz' work. My feeling is that there is still quite a bit to be done before it's worth trying the Debian bootstrap again.
If you or someone else wants to continue where I stopped with the Debian rebuilding without waiting for the complete glibc port, adding a new armhf target to glibc on top of the current glibc-y2038 tree is probably a quicker way to get something that builds and boots. I don't know how much work exactly there would be for this approach, but my feeling is that it's not that much after looking at the kind of problems I ran into, and at the state of the riscv32 port that uses the same approach.
Arnd