[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
ELF TLS ABI vs L4 ABI
ELF TLS ABI vs L4 ABI
Thu, 18 Nov 2004 13:26:18 +0100
Wanderlust/2.10.1 (Watching The Wheels) SEMI/1.14.6 (Maruoka) FLIM/1.14.6 (Marutamachi) APEL/10.6 Emacs/21.3 (i386-pc-linux-gnu) MULE/5.0 (SAKAKI)
ELF TLS ABI and the Hurd port to L4
Summary: The ELF TLS ABI and the L4 ABI overlap: Both reserve a
register to access a thread pointer that points to a TCB. I propose
different strategies to resolve these conflicts, and explain the
advantages and disadvantages.
You can find some background info below. For the sake of the experts,
I put the meat up front.
Solution 0: For all architectures (except ia32):
L4 can be changed to not need to reserve a register. The costs are
At startup, the thread will get its UTCB address in the register
currently defined by the L4 ABI. This ensures backwards compatibility
with the official L4 ABI. For ELF threads, the UTCB address is stored
away in the ELF TCB. Then the register is used according to the ELF
TLS ABI. The UTCB address can be accessed as quickly as any other ELF
TLS. A special version of libl4 for the ELF system provides the
official L4 API on top for ELF threads.
The changes to L4 are:
* The thread register must be saved and restored as any other normal
register (this is cheap).
* For local IPC to work, the kernel must be able to detect the
user-level thread switch. For this, either memory in the UTCB of
the thread can be used, or a r/w-able kernel page (per cpu) can be
used for that. This is still cheap.
* Some system call stubs need to access the UTCB in user space. For
these, I'd like to have different user space stubs which instead
take the UTCB address as an argument in a free register and can be
used by the ELF version of libl4. In fact, one could even have
those stubs access the UTCB in the ELF TLS directly, if one wants to
(but then they should definitely not be in the KIP, but maintained
in glibc). The ELF version of libl4 would then access those stubs
instead of the official ones.
To me this looks like a solution that can provide full support of the
ELF TLS ABI, with full backward compatibility, at minor overhead. Too
good to be true? Read on.
ia32 "You ignore ia32 at your own peril."
ia32 is problematic. Let me illustrate this the following way. Here
is a list of objectives:
[TS] Support the ELF TLS ABI, SUN variant.
(%gs:0 contains thread pointer)
[TG] Support the ELF TLS ABI, GNU variant.
(%gs is a 4GB segment with thread pointer as base address)
[US] Support limited user segments, for example for wine and libGl.
Limited user segments are those which are only allowed to cover
memory in the 0-3GB range.
[SS] Support small spaces (has the optimization effect of tagged TLBs
on modern processors - reduces context switches to and from a
subset of all tasks).
[CS] Have fast context switches. Reloading segment registers is slow.
So this means essentially: Do not reload segment registers if
[TS] is quite simple to support, by changing L4 to store the UTCB in
%gs:4, and let the user set %gs:0 to any value (which is saved and
restored by the kernel).
[TG] is the original objective (and in that case, we also have to
apply the same strategies as for the other architectures, described
above). Unfortunately, [TG] is mutually exclusive with [SS] and [CS].
[SS] Small spaces rely on segment limits for protection. Considering
that applications for small spaces include performance critical parts
like the user-space scheduler, and irq handlers/drivers
(fast-ethernet?), we are sceptical if the ELF TLS optimization of [TG]
over [TS] is worth this.
[CS] Fast context switches are particularly important for IPC.
Currently, L4 is heavily optimized. Adding "tons of cycles" to every
context switch, even for IPC, can have a huge impact. We can at least
control this by only requiring this cost when going to or from a task
that is using a user segment (ie, all Hurd tasks, but not device
drivers, etc). So, maybe, this point is only secondary.
[US] is a feature that has only limited application - there may be
other ways to support emulation, and nothing will make the libGl
people happy ;). I added it to show what is at stake. Of course, if
[TG] is supported, supporting [US] is a no-brainer. Supporting [US]
can be done in a controlled way, without affecting context switches
between non [US] tasks much (you only pay the cost when switching to
or from a task using user segments).
So, here are two possible solutions:
1. [TG], ![SS], ![CS] for [TG] and [US] threads, [US] trivially
2. [TS], [SS], [CS] (except for [US] threads), [US] if needed
Note that [TS] is ELF TLS, too, just not the optimized GNU variant.
We think that the potential difference is so huge that it is worth to
eventually implement both methods and check the difference. After
everything is done, it is the actual numbers that will probably give
us a basis to make a smart decision.
For all architectures:
The ELF TLS ABI defines how ELF objects (static and dynamic objects)
can define and use data that is allocated per thread. The support for
this is spread all over the toolchain. gcc supports a new keyword
__thread that can be used to declare a variable thread-local. The
thread library in glibc allocates and manages the storage, and the
linker supports the new sections in the ELF format for this.
For this discussion, only the way to obtain the thread pointer to the
ELF TCB is of relevance. For optimization, access to the thread
pointer is usually inlined, if possible. This means that once you
decide how and where the thread pointer is stored (usually in a
dedicated register), gcc will generate code to use it, and handwritten
assembler in glibc will use it, too. So, changing the way to get at
the thread pointer is potentially painful. How to get the thread
pointer is architecture specific (see below), and also defined by the
ELF TLS ABI.
The L4 ABI defines how the user thread can get at the address of its
user thread control block (UTCB). The UTCB is used to access various
virtual registers and to implement some parts of system calls in user
space. For each architecture, a dedicated register is reserved for
A particularly interesting use of the UTCB register is local IPC.
Local IPC can be implemented in user space by a user-level thread
switch. If you don't return from the local IPC within your timeslice,
the thread will be preempted. The kernel will notice that the UTCB
address is not the one it expected it to be, and will fix-up the state
of the threads to make the IPC that happened visible to the kernel
data structures and the other threads in the system. So, the UTCB
register is not only used to communicate the UTCB address from the
kernel to the user, but also the other way round.
ia32 is different, in that it does have almost no registers. This is
recognized by everyone. Consequently, neither ELF nor L4 use a
general purpose register. Both use the %gs segment register.
ELF comes in two variants here, the SUN variant and the GNU variant
(by Drepper). The SUN variant expects the thread pointer to reside in
%gs:0, and does not have any further requirements on the %gs segment
selector or descriptor. The GNU variant expects that the segment
descriptor selected by %gs has the TCB address as the base address of
the segment, and that the limit of the segment is set to the maximum
of 4GB (minus one). This is then exploited in accessing data before
and after the actual TCB address with positive and negative offsets
(%gs:offset). Negative offsets wrap around correctly only if the
segment limit is set to maximum. This allows for very efficient TLS
access (single instruction access at best).
L4 uses the %gs segment selector, too, but the segment descriptor is
used globally and never changes. It points to a per-CPU data
structure that contains the UTCB address. At thread switch, only the
word at %gs:0 is updated.
L4 can be configured on ia32 to support small address spaces. These
are mapped into all other address spaces, like the kernel memory, but
can not be protected via paging. In this configuration, L4 uses a
segmented memory model. The small address space concept replaces the
tagged TLBs of modern processors, and reduces the costs of context
switches due to TLB flush.
For other architectures:
For some architectures, the ELF TLS ABI and the L4 ABI can co-exist
(ia64). For several architectures, they use exactly the same general