Re: [MIT-Scheme-devel] Rework amd64 ABI -- flag day

On Sat, Jan 5, 2019 at 12:16 AM Taylor R Campbell <address@hidden> wrote:

Summary:

I would like to merge some incompatible changes to the amd64 compiled
code ABI that substantially improve system performance all around,
especially for closure-heavy code. It shaves about 20% off the test
suite time (with FAST=y); some workloads run 5x or 10x faster. The
branch is riastradh-20181220-closentry-v2.

But...this will require a flag day -- you'll have to cross-compile a
new system from an old one, which is a pain, though less of a pain now
that cross-fasdump-v2 is merged.

When would be a good time to do this?

Full story:

Back when I wrote the amd64 back end a decade ago, I made it work, but
I ran out of steam to make it work well, and then we all started using
it so it was a pain to change, so there were some bad decisions
embedded in it.

For example, the x86 representation of closures -- with embedded CALL
instructions to jump to the real code and push the pointer to the data
on the stack -- might have made sense in 1990, but it's about the
worst way imaginable to abuse modern CPUs:

- consing a closure entails writing instruction memory that the first
invocation will have to load into the instruction cache (and reload
after every GC),
- the CALL is never paired with a RET so it breaks return address
branch target prediction,
- and to top it off nearly every unknown procedure call shares a
single JMP instruction (per arity) in cmpauxmd.s (x86-64.m4) so the
CPU can't discriminate to refine indirect branch target prediction.

Recently this was bugging me, so I took some time to fix some of the
mistakes I'd made and confirm they give various workloads substantial
performance improvements.

Merging all these changes will require a flag day: we'll have to build
a new system with a cross-compiler, and old coms won't work. The
cross-fasdump-v2 branch will make that process less painful, but it's
still a flag day.

Here's the changes I'm sitting on that require a flag day:

- Split the compiled-entry type code into compiled-entry and
compiled-return, so that calling conventions for heap closures and
stack continuations don't compete on engineering constraints.

Either tag can still refer to any kind of compiled code:
expressions, continuations, open or global procedures; this is
convenient for, e.g., interrupt checks, where we may need to restart
a procedure from the top or return to a continuation, and convenient.

This change affects all architectures, though only amd64 takes
advantage of it at the moment by choosing the following
representations:

. Compiled-entry is a pointer to an _offset_ to instructions.

=> This way a compiled closure is just an array of offsets, and
consing a closure entails writing no new instruction memory and
invoking a newly consed closure entails no instruction cache
reloads.

=> This makes a big difference for closure-heavy code -- 5x speed
improvements in some of my microbenchmarks. And execute caches
(`uuo-links') can optimize away the indirection for anything
except closures (and maybe could optimize those away too with
some more help from the GC).

=> There is a cost: unknown calls to fixed, long-lived, often-used
closures or to open procedures are marginally costlier because
of the additional memory indirection, but I haven't measured
this to be over 5% even in microbenchmarks that are nothing
_but_ these cases. So I think this tradeoff is well worth it.

. Compiled-return is a pointer to a _jump instruction_.

=> This way we can use CALL to push a continuation, and a paired
RET to return to it, which means we can take advantage of the
CPU's return address branch target predictor.

=> The jump instruction has to go to a compiled entry with zero
offset so we can still parse continuations just like before.
It'll look like this:

;; Procedure call.
(CALL (@PCR pushed))
(JMP (@PCR continuation)) ; Return address points here.
pushed:
(OR (@R ,rsp) (& ,tag)) ; Tag the continuation.
(PUSH Q (& 123)) ; Push additional arguments.
...
(JMP (@PCR uuo-link)) ; Call procedure.
(WORD ...) ; Continuation entry metadata.
(BLOCK-OFFSET continuation)
(QUAD U 0) ; Zero offset to instructions.
continuation:
(PUSH Q (R 0)) ; Start of continuation code.

;; Procedure return (ignoring interrupt checks).
(AND Q (@R ,rsp) (R ,rbp)) ; Detag continuation.
(RET) ; Pop and return.

- For INVOCATION:APPLY unknown procedure calls, call an assembly hook
subroutine to set up an application, but generate a JMP instruction:

(POP Q (R ,rbx)) ; Pop procedure into rbx.
(CALL (@RO ,regs #x128)) ; Call assembly hook.
;; The assembly hook sets rax to the address of the
;; procedure's starting instruction if it can, or to another
;; assembly hook that defers to the interpreter if it can't.
(JMP (R ,rax))

This way the CPU can train each call site with a different branch
target prediction, rather than merging them all in an assembly hook
that executes the JMP itself. Gave about 2x speed increase on some
microbenchmarks.

- Tweak some other compiler utilities to use CALL/RET responsibly.
Interrupt routines don't because they're always going to wreck the
return address branch prediction stack by returning to microcode
anyway, and it's more convenient for them to have the entry address.
On architectures where compiled entries and compiled returns have
the same representation this makes no difference.

- Use rax, not a slot in the registers block in memory, as the return
value register -- but still let the machine register allocator hand
out rax.

. Requires some changes to the RTL generator to make sure we read
the value register first thing in a continuation, or write it last
thing before a return.
=> As it happens, there was already a bug here with interpreter
call results (which go in rax too) and procedures with dynamic
links.
=> This is just a matter of inserting some temporaries, which
evaporate into aliases for the RTL register allocator.

. Requires some changes to the RTL optimizer to avoid propagating
available machine registers in CSE and code compression so that
the read and write stay at the beginning or end of a block.

. Requires some changes to the LAP generator to avoid offering rax
for allocation when it is the target of the RTL instruction we are
working on right now.

. Requires some minor changes to the assembly hooks to avoid
clobbering the return value, e.g. trampolines now use r9b instead
of al for the trampoline code.

- Open-code with-interrupt-mask and with-interrupts-reduced (in the
LAP generator, not in the RTL generator) so that we don't have to go
through reflect-to-interface, which breaks the return address branch
target predictor. This gave about a 2x performance improvement on
promise operations (after I sped them up a lot by removing one of
the without-interruptses already).

My branch seems to be pretty stable now and I'm satisfied with all the
ABI changes (much as it might be cute -- and not too hard now with the
portable fasdumper -- to also try migrating to a NaN-tagging scheme),
so I'd like to find a convenient time to merge it. Thoughts?

(Review also welcome before I merge!)

_______________________________________________
MIT-Scheme-devel mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/mit-scheme-devel

From:	Chris Hanson
Subject:	Re: [MIT-Scheme-devel] Rework amd64 ABI -- flag day
Date:	Sat, 5 Jan 2019 17:59:40 -0800