The final result for amd64 looks like:
static __inline__ long
FT_MulFix_x86_64( long a,
long b )
{
register long result;
__asm__ __volatile__ (
"movq %1, %%rax\n"
"imul %2\n"
"addq %%rdx, %%rax\n"
"addq $0x8000, %%rax\n"
"sarq $16, %%rax\n"
: "=a"(result)
: "g"(a), "g"(b)
: "rdx" );
return result;
}
The use of long, though requires review. The C version uses FT_Long
(not FT_Int32 like the other asm versions), but FT_Long is not a #define
or a typedef at the point where the asm version are located.
That said, using long there on amd64 prevents unnecessary 32<->64 bit
conversions in the resulting code.
The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu.
The assembly generated by the C code is 45 lines and 158 octets long,
contains six conditional jumps, three each of explicit compares and
tests, and still benchmarks are just as fast. Out-of-order processing
wins out over hand-coded asm. :-/
It *might* make more of a difference on an in-order processor like the
Arom. But I do not have one to test.
I can still finish a patch, and have collected the info I need to do one
for mips64, too, where I expect it will be more important. I also expect
that the i386 version could be tidied a bit.
Is the amd64 version desired, given how little benefit it has?
-JimC