[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[RFC PATCH 00/34] The rest of the x86_64-gnu port
From: |
Sergey Bugaev |
Subject: |
[RFC PATCH 00/34] The rest of the x86_64-gnu port |
Date: |
Sun, 19 Mar 2023 18:09:43 +0300 |
Hello!
(Naturally, the subject line is a reference to the "How to draw an owl" meme.)
It's been more than a month since I've tried to run ./configure
--host=x86_64-gnu and see what would come out of it, and here we are now:
with these patches, glibc fully builds, and even somewhat "works"!
On testing
==========
By "works", I mean:
I was unable to actually get it running on GNU Mach. It either never gets
started, or crashes soon enough. The latter is actually to be expected, since
the kernel does not actually support i386_fsgs_base_state yet. I was unable
to investigate what exactly happens, because in addition to the troubles with
actually running GNU Mach on qemu-system-x86_64 (-kernel doesn't work..., you
really have to build an image with GRUB) and attaching a debugger to it
(either GDB or QEMU get utterly confused be the switch to the long mode...),
I had troubles with actually spawning the task while breaking on its first
instruction (a la starti). In particular prompt-task-resume didn't seem to
work for me, nor did breaking somewhere before the task should have been
resumed.
So I would appreciate some help with both testing this patchset (i.e. if you
do have a working x86_64 Mach + userspace setup, build glibc and try to run
it), and some general tips about how I would go about debugging the bootstrap
task from the first instruction onwards with x86_64 GNU Mach, QEMU, and GDB.
Anyone? Luca (cc'ed), perhaps you could help me with testing & give me some
tips?
Instead of testing on GNU Mach, I settled for the next best thing and tested
it on GNU/Linux, under GDB. I had to skip over the syscalls and emulate their
effects, either in my head (e.g. the return value of mach_reply_prot ()), by
writing $fs_base in GDB for thread_set_state (i386_fsgs_base_state), or by
making a Linux syscall to mmap some anonymous memory (for vm_map). Obviously
this is not the same thing as running it on the Mach for real, but -- it went
fine, and reached main ()! This means there likely aren't any catastrophic
issues with early startup (think init-first.c), TLS setup / accesses, etc.
On assembly and registers
=========================
This patchset involves code that has to deal inline assembly and/or registers,
such as intr-msg.h, longjmp, and sigreturn. I have written *something* that
looks like it might work, but without actual testing, it's hard to know if it
does. We can't really test any signal-related code until there's enough of the
Hurd running to have a proc server, etc.
As for sigreturn specifically: I'm concerned about the possibility that
putting the register dump onto the user's stack (or at %rsp - 128, on x86_64)
may clobber the data trampoline.c puts there (unless an altstack is used),
including the very sigcontext. This applies to both i386 and x86_64.
Empirically, we know this works out fine for i386 -- maybe sigcontext doesn't
actually get overwritten, or gets overwritten in just the right way (think
memmove, although i386/sigreturn.c actually uses memcpy...).
I also haven't given much thought to FP state manipulation, since I know very
little about it. It might be that it's broken entirely.
In any case, it wouldn't hurt if you review my attempts at asm & register
manipulation extra carefully.
On TLS and microoptimization
============================
As you can see, I've done a bunch of changes to how TLS-related things work,
on both x86_64 and i386. The reasons for this are:
1. I wanted to minimizae <hurd/threadvar.h> and its usages, so every line
dropped from <hurd/threadvar.h> is a small win. In the end, only
__hurd_sigthread_stack_{base,end} remain. I think these could be moved
to <hurd/signal.h>, and we should be able to rid ourselves of
<hurd/threadvar.h> for good.
2. I have discovered that the way __hurd_local_reply_port is declared is prone
to GCC miscompiling accesses to it (reported here: [0]). Even when not
miscompiled, this resulted in pretty inefficient code generation. Since two
of the three places where __hurd_local_reply_port was used were in signal
code where we know for sure that TLS is already working (since we must be
running the signal thread), they could access tcb->reply_port directly
(using the appropriate THREAD_*MEM accessor macros), and the rest of
__hurd_local_reply_port / __LIBC_NO_TLS logic could be moved into
mig-reply.c (and improved/specialized there), so that's what I've done.
[0]: https://sourceware.org/pipermail/libc-alpha/2023-March/146304.html
3. Disabling / compiling out support for the no-TLS case in libc.so (and
libpthread.so, etc.) -- but not in static builds, and not in ld.so. This
turned out to be kind of required for x86_64 (more on that below), but it
is a nice optimization in and of itself.
To illistrate the overall effect of these optimizations, here's a comparision
of the code generated for mig_get_reply_port () in libc.so for i386:
Before the changes (as shipped in Debian GNU/Hurd):
Dump of assembler code for function __GI___mig_get_reply_port:
0x0001c0a0 <+0>: push %ebp
0x0001c0a1 <+1>: mov %ds,%dx
0x0001c0a4 <+4>: mov %gs,%ax
0x0001c0a7 <+7>: mov %esp,%ebp
0x0001c0a9 <+9>: push %esi
0x0001c0aa <+10>: call 0x1e1bc5 <__x86.get_pc_thunk.si>
0x0001c0af <+15>: add $0x24af45,%esi
0x0001c0b5 <+21>: push %ebx
0x0001c0b6 <+22>: cmp %ax,%dx
0x0001c0b9 <+25>: je 0x1c130 <__GI___mig_get_reply_port+144>
0x0001c0bb <+27>: mov %gs:0x0,%eax
0x0001c0c1 <+33>: mov 0x38(%eax),%eax
0x0001c0c4 <+36>: test %eax,%eax
0x0001c0c6 <+38>: je 0x1c110 <__GI___mig_get_reply_port+112>
0x0001c0c8 <+40>: mov %ds,%dx
0x0001c0cb <+43>: mov %gs,%ax
0x0001c0ce <+46>: cmp %ax,%dx
0x0001c0d1 <+49>: je 0x1c0f8 <__GI___mig_get_reply_port+88>
0x0001c0d3 <+51>: lea 0x1798(%esi),%edx
0x0001c0d9 <+57>: mov %gs:0x0,%eax
0x0001c0df <+63>: lea 0x38(%eax),%ecx
0x0001c0e2 <+66>: cmp %edx,%ecx
0x0001c0e4 <+68>: je 0x1c0f8 <__GI___mig_get_reply_port+88>
0x0001c0e6 <+70>: mov %ds,%bx
0x0001c0e9 <+73>: mov %gs,%cx
0x0001c0ec <+76>: cmp %cx,%bx
0x0001c0ef <+79>: je 0x1c110 <__GI___mig_get_reply_port+112>
0x0001c0f1 <+81>: mov 0x38(%eax),%eax
0x0001c0f4 <+84>: cmp %eax,(%edx)
0x0001c0f6 <+86>: je 0x1c110 <__GI___mig_get_reply_port+112>
0x0001c0f8 <+88>: mov %ds,%dx
0x0001c0fb <+91>: mov %gs,%ax
0x0001c0fe <+94>: cmp %ax,%dx
0x0001c101 <+97>: je 0x1c140 <__GI___mig_get_reply_port+160>
0x0001c103 <+99>: pop %ebx
0x0001c104 <+100>: pop %esi
0x0001c105 <+101>: mov %gs:0x0,%eax
0x0001c10b <+107>: pop %ebp
0x0001c10c <+108>: mov 0x38(%eax),%eax
0x0001c10f <+111>: ret
0x0001c110 <+112>: mov %ds,%dx
0x0001c113 <+115>: mov %gs,%ax
0x0001c116 <+118>: cmp %ax,%dx
0x0001c119 <+121>: je 0x1c150 <__GI___mig_get_reply_port+176>
0x0001c11b <+123>: mov %gs:0x0,%ebx
0x0001c122 <+130>: add $0x38,%ebx
0x0001c125 <+133>: call 0x1b7b0 <__GI___mach_reply_port>
0x0001c12a <+138>: mov %eax,(%ebx)
0x0001c12c <+140>: jmp 0x1c0f8 <__GI___mig_get_reply_port+88>
0x0001c12e <+142>: xchg %ax,%ax
0x0001c130 <+144>: lea 0x1798(%esi),%eax
0x0001c136 <+150>: mov (%eax),%eax
0x0001c138 <+152>: jmp 0x1c0c4 <__GI___mig_get_reply_port+36>
0x0001c13a <+154>: lea 0x0(%esi),%esi
0x0001c140 <+160>: lea 0x1798(%esi),%eax
0x0001c146 <+166>: pop %ebx
0x0001c147 <+167>: pop %esi
0x0001c148 <+168>: pop %ebp
0x0001c149 <+169>: mov (%eax),%eax
0x0001c14b <+171>: ret
0x0001c14c <+172>: lea 0x0(%esi,%eiz,1),%esi
0x0001c150 <+176>: lea 0x1798(%esi),%ebx
0x0001c156 <+182>: jmp 0x1c125 <__GI___mig_get_reply_port+133>
End of assembler dump.
After the changes:
Dump of assembler code for function __GI___mig_get_reply_port:
0x00020060 <+0>: mov %gs:0x38,%eax
0x00020066 <+6>: test %eax,%eax
0x00020068 <+8>: je 0x20070 <__GI___mig_get_reply_port+16>
0x0002006a <+10>: ret
0x0002006b <+11>: lea 0x0(%esi,%eiz,1),%esi
0x0002006f <+15>: nop
0x00020070 <+16>: sub $0xc,%esp
0x00020073 <+19>: call 0x1f790 <__GI___mach_reply_port>
0x00020078 <+24>: mov %eax,%gs:0x38
0x0002007e <+30>: add $0xc,%esp
0x00020081 <+33>: ret
End of assembler dump.
I think that this is pretty nice :) Note that I didn't focus on optimizing
mig_get_reply_port () specifically, and also that the versions in ld.so and
in libc.a are more complex (but still nowhere near as complex as the original).
Now, the horror story about __LIBC_NO_TLS () and __libc_tls_initialized:
Last time, when I realized that ld.so is lazily pulling object files out of
libc that something references, I understood that just putting the
__libc_tls_initialized into init-first.c would not work, for two reasons: for
one, that would cause ld.so to have it's own local copy of
__libc_tls_initialized -- we wouldn't want that, we want ld.so and rtld to
have a consistent idea of whether or not the TLS is initialized. Secondly,
this would pull in init-first.o into rtld, which is very wrong, and in fact
init-first.c even contains code to intentionally cause a linking error if this
ever happens.
The latter could be solved by just declaring __libc_tls_initialized outside of
init-first.c, but the former, I thought, actually required defining it into
ldsodefs.h, to be renamed dl_tls_initialized and accessed using the GL()
macro. When I asked [1], nobody discouraged me from going this way.
[1]: https://sourceware.org/pipermail/libc-alpha/2023-March/146254.html
But in trying to implement that, I ran into trouble with including
<ldsodefs.h> in <tls.h>. Namely, it turned out that there's an inverse
dependency between these two headers already. <ldsodefs.h> was explicitly (and
needlessly) including <tls.h> -- that was easy to get rid of -- but also
implicitly depending on it in several ways. First, it includes <link.h> (for
struct link_map), and that needs <tls.h> to define FORCED_DYNAMIC_TLS_OFFSET
or something like that. Second, it needs to define some locks, so it includes
<libc-lock.h>, and that immediately needs <tls.h> for __libc_lock_owner_self.
Moreover, <libc-lock.h> includes <lowlevellock.h>, and that includes
<atomic.h>, and then <x86/atomic-machine.h> again includes <tls.h> for the
tcbhead_t definition.
So as you can see, there are quite a few ways that <ldsodefs.h> wants to
include <tls.h>! And naturally, including <ldsodefs.h> in <tls.h> then fails
inside <ldsodefs.h>, where it discovers that __rtld_lock_define_recursive is
not defined and so on.
So... I came up with three (!) different ways to work around that, before
coming up with the fourth one, as included in this patch set.
The first way I implemented this was with a pair of out-of-line functions,
__libc_no_tls () and __libc_set_tls_initialized (), whose implementation in a
separate file could freely include <ldsodefs.h>. __libc_no_tls () had to be
exported out of libc.so, @@GLIBC_PRIVATE, but other than that, it seemed to
work. But then I happened to take a look at the generated code and naturally
discovered that it didn't get LTO-inlined (and why would it, if I'm not
building with LTO -- nor would LTO work cross-DSO in any case). Doing an extra
function call (through PLT if we're talking about libpthread.so...) for a
function that is literally a load of a single byte sounded bad. Really, how
could I settle for bad code generated -- all becuase I couldn't figure out
some stupid header dependencies?
So for the second attempt, I forced the headers to work the way I needed them.
This involved some unpretty kludges; for instance here's how the kludge in
<tls.h> looked:
/* If we're not being included from inside (or after) these few headers,
include ldsodefs.h for the GL macro. Otherwise, those headers will
include (or have already included) ldsodefs.h themselves. This is done
in this weird way because of issues with circular dependencies between
these headers. */
#if !defined (_LIBC_LOCK_H) && !defined (_MACH_LOWLEVELLOCK_H) \
&& !defined (_LOCK_INTERN_H) && !defined (_LINK_H) \
&& !defined (_X86_ATOMIC_MACHINE_H)
# include <ldsodefs.h>
#endif
add to this more instances of #include <ldsodefs.h> (guarded by similar
include guards) scattered across various random headers, and -- it builds.
The generated code now was what I wanted it to be (a direct access to
GL(dl_tls_initialized)), but this was obviously not pretty or nice.
A much cleaner solution, I thought, would be to split the various headers
involved into more granular parts. For instance, <ldsodefs.h> only really
needs to declare the locks, but not to actually lock and unlock them. If we
moved the lock declaration macros to a new, smaller header, say,
<libc-lock-def.h>, then that would not need <tls.h>. A smiliar split would be
nedded for <lowlevellock.h>. <tls.h> itself could be split: NPTL already has
a separate <tcb-access.h> where THREAD_{G,S}ETMEM macros are; but I was
imagining making it more granular still: for instance, we could have
<tls/tcb.h> which only defines the tcbhead_t layout (and that's what
<x86/atomic-machine.h> would include), <tls/access.h> which defines the
accessors, <tls/setup.h> which defines the various functions to set up TLS for
a thread, <tls/gscope.h> for the GSCOPE decls, and on the Hurd, <tls/no-tls.h>
for __LIBC_NO_TLS (). Each of the headers will be small and only bring in what
it really needs, and not everything and the kitchen sink.
This would be a clean and nice solution -- but it would be quite invasive,
and require changes in all the ports. And I have neither the hardware nor the
capacity to test that this breaks nothing on architectures I know nothing
about. (For instance, what's or1k or nios2? I've no idea.) And it's unlikely
that such a large, but poorly executed and tested reorganization would be
accepted simply because it would be convenient to the Hurd port.
Still, this seemed like the best way to pursue, so I half-implemented a
limited form of this (only splitting the locking headers). This was enough to
get x86_64-gnu to build without the include guard kludges. But then again, I
would need to do the same changes to NPTL, and without a way to really test
them, I didn't feel confident enough.
So that's when I had a small epiphany: TLS for the initial thread is always
set up inside rtld, before it passes control over to libc! There's no need to
share the flag with libc.so, since inside libc.so, it's always initialized!
We can just defined __LIBC_NO_TLS () to 0 outside of rtld (in shared builds).
This instantly solves the issue of circular includes (no longer need to use
ldsodefs!), and also makes the generated code even smaller / more efficient
at runtime, since we can statically compile out the no-TLS branches.
This logic *might* be broken for ifunc resolvers (I don't know -- is it?),
but then apparently they're not able to use most of the normal libc
functionality anyway, so hopefully this is not a big deal.
So much about the TLS, let's finally jump to the
Conclusion
==========
So, yeah, this is "the rest of the x86_64-gnu port". Please do review, try to
build it, and try to run it if you can. And teach *me* to run it, if you know
how to. I have tested that i686-gnu still builds and works, but more testing
is needed.
Some things are still missing, for instance I haven't looked at implementing
{get,set,make,swap}context. It seems they aren't required for basic operation.
And naturally, once we start running/using this for real, we'll discover what
else is missing or broken.
I hope I didn't screw up the rebasing anywhere, but this is a pretty large
patchset, so I might have. If you see a commit that doesn't make sense, or
some "AMENDME" or "fixup" in the commit message or some such, please let me
know :)
I have also started a port of the Hurd proper to x86_64, but I am not sending
out the patches for that yet.
Sergey
- [RFC PATCH 00/34] The rest of the x86_64-gnu port,
Sergey Bugaev <=
- [RFC PATCH gnumach 01/34] Add i386_fsgs_base_state, Sergey Bugaev, 2023/03/19
- [RFC PATCH gnumach 02/34] Remove bootstrap.defs, Sergey Bugaev, 2023/03/19
- [RFC PATCH gnumach 03/34] Make exception subcode a long, Sergey Bugaev, 2023/03/19
- [RFC PATCH glibc 04/34] hurd: Make exception subcode a long, Sergey Bugaev, 2023/03/19
- [RFC PATCH glibc 05/34] hurd: Remove __hurd_threadvar_stack_{offset, mask}, Sergey Bugaev, 2023/03/19
- [RFC PATCH glibc 09/34] hurd: Fix _hurd_setup_sighandler () signature, Sergey Bugaev, 2023/03/19
- [RFC PATCH glibc 06/34] hurd: Swap around two function calls, Sergey Bugaev, 2023/03/19
- [RFC PATCH glibc 07/34] hurd: Fix file name in #error, Sergey Bugaev, 2023/03/19
- [RFC PATCH glibc 12/34] hurd: More 64-bit integer casting fixes, Sergey Bugaev, 2023/03/19
- [RFC PATCH glibc 13/34] x86-64: Disable prefer_map_32bit_exec tunable on non-Linux, Sergey Bugaev, 2023/03/19