[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Bug-gnubg] Benchmarks, experiments, speedups
RE: [Bug-gnubg] Benchmarks, experiments, speedups
Sat, 26 Feb 2005 17:48:48 +0100
I dived a little deeper into gnubg NN speed issues. The experiment was as
follows. I started with the template rewrite code I have mentioned before
(attachement listings 8 and 9).
> 1) Compiler
> The SSE and SSE2 extensions to the x86 processor set
> introduce a limited SIMD (single instruction multiple date)
> capability to more recent Pentiums and AMDs. The bottomline
The compilers were gcc 3.4.1 as per the current "experimental" binary
distribution of cygwin, and Intel icc 8.1.24 C++ for Windows. The test case
was set to
Lines 495 to 504, with particular interest in the inner loop lines 500 and
501 (attachment listing 1). In the first experiment I replaced the
Evaluation and EvaluationFromBase routines with the template code
(attachment listings 8 and 9).
I've produced 6 binaries:
icc with templates optimized for SSE2 [ICCTSSE2]
icc without templates optimized for SSE2 [ICCNTSSE2]
icc with templates optimized for 386 [ICCT386]
icc without templates optimized for 386 [ICCNTSSE2]gcc with templates
gcc with templates optimized for SSE2 [GCCTSSE2]
gcc without templates optimized for SSE2 [GCCNTSSE2]
The optimizations can be seen in the attachment, listings 10 and 11. For the
binaries, inlining was active and for producing assembly listings it was
turned off (otherwise it's almost impossible to understand the code
Two (quite moved, so mixed positions) matches were selected as benchmarks,
[7PT] A 7pt, result 6-8, Player 1 235 moves, Player2 232 moves in 12 games
[11PT] An 11pt, result 11-9, Player 1 222 moves, Player 2 219 moves in 11
[SEVALS] static Evaluations/s as displayed per calibrate (coarse estimate)
Here are the timings on my P4C@@2.4Ghz (+4% overclocking, so effectively
2496 MHz), tests for ICC in GUI mode and for GCC in no-gui mode:
[7PT][ICCTSSE2] 5:45 = 345 s
[7PT][ICCNTSSE2] 5:58 = 358 s
[7PT][ICCT386] 6:18 = 378 s
[7PT][ICCNT386] 6:32 = 392 s
[7PT][GCCTSSE2] (not taken)
[7PT][GCCNTSSE2] (not taken)
[11PT][ICCTSSE2] 6:35 = 355 s
[11PT][ICCNTSSE2] 6:38 = 358 s
[11PT][ICCT386] 7:15 = 392 s
[11PT][ICCNT386] 7:12 = 389 s
[11PT][GCCTSSE2] 7:04 = 424 s
[11PT][GCCNTSSE2] 7:15 = 432 s
[SEVALS][ICCTSSE2] 29.250 e/s
[SEVALS][ICCNTSSE2] 31.800 e/s
[SEVALS][ICCT386] 24.200 e/s
[SEVALS][ICCNT386] 23.600 e/s
[SEVALS][GCCTSSE2] 23.000 e/s
[SEVALS][GCCNTSSE2] 22.800 e/s
Thus the results are:
1. The template code has little or no effect on e/s display, in the case
[ICCTSSE2] it actually went down.
2. The template code has no negative effect for any test, but little effect
3. The two cases where tests were made faster are [7PT][ICCTSSE2] (3.7%) and
4. The real world test results contradict the eval/s display in some cases,
[7PT][ICCTSSE2] in particular
The reason for 4., I found later, is that the eval/s does not test all nets
equally but only a certain profile. So the overall overhead (by the switch
statement) overrides the real performance (i.e. when nets are used mixed vs.
profile in eval/s). So my conclusion is the template approach, given it's
uglyness and minor effect, is most likely not a candidate for the CVS.
Next I asked myself
1) Why there is so little effect
2) Why gcc is so desperately bad compared to icc
It showed that both questions have about the same answer
The original idea was that a compiler could apply better optimization to
loops if the length is statically known at compile time. So I looked at the
code generated by gcc and icc for the loop from listing 1. It's in listings
2 and 3, with some comments of mine.
To understand the following, you need to know that most SSE instructions
come in 2 flavours:
(1) working on the lowest of the 4 128-bit register words only ("ss"
versions) or on all 4 in parallel ("ps") versions. The execution time for
"ps" and "ss" is exactly the same.
(2) working on quadword aligned data or working on unaligned data. Aligned
execution is more than twice as fast, the reason is in the CPU's cache
architecture. For unaligned access, entries in the same 128 bit parameter
may be distributed over 2 cache lines in all caches (CPU, L1, L2) so it
requires a lot more checking.
GCC: The code generated by gcc is poor at best. It uses SSE instructions,
but not in parallel mode. Furthermore it uses them unaligned, i.e. does not
care for quadword (16 byte) alignment. So it chooses SSE instructions, but
handles them as if they were 387 instructions. It neither exploits
parallelism nor memory architecture. Another nastyness is the fact that gcc
transfers interim results from the inner, hot loop body to a memory location
each iteration. There is absolutely no need to do so, working in registers
would be fine as long as the result is transferred at the end. To make
things worse, gcc re-reads the memory location. This double nonsense is a
definite performance killer.
ICC: The code generated by icc is hard to beat even for hand written
asssembler. A loop is done in several stages.
[stage 1] Looplength = 0 => skip loop
[stage 2] Looplength < 11? => don't bother optimizing, do it unaligned/ss
[stage 3] Alignment of data is quadwords => if so, use aligned, 2-fold
unrolled loop body
[stage 4] Alignment of data not quadwords
[stage 4a] Check length of loop. If it's worth the effort, do a little
pre-loop until an aligned memory position is reached, then go on as stage 3
[stage 4b] If it's not worth the effort, do an unaligned/PS version of the
[stage 5] clean-up loop. This handles the trailing (i.e. looplength % 4
iterations) elements of the loop which were left by the parallel loops (i.e.
looplength / 4*2 iterations)
[stage 6] Transfer result from registers to memory
So it's clear now why gcc is outperformed by 1/3 compared to icc. gcc just
produces poor code not really exploiting the SSE and SSE2 instruction set,
but does a half-hearted attempt to at least use SSE rather than 387 code.
That's because even the "ss"/unaligned commands are faster than 387.
The question why the template patch did not improve performance with icc
much was answered by a look at assembly generated by icc for [ICCNTSSE2] and
[ICCTSSE2]. Only the smallest loop (case 5,5) was unrolled completely, the
other code was unchanged. With the knowledge of the pretty complicated
alignment pre-loop that makes sense. Unrolling may cause execution on
unaligned addresses and so actually decrease throughput. For the (5,5) loop
this is not relevant, it falls below the threshold of [stage 2] anyway and
thusis always done in [stage4b].
So why performance was increased at all? Mainly because of the BETA* float
constants, which were now inlined (BETAHIDDEN) or optimized away fully
(BETAOUTPUT == 1.0f). This saves some multiplications. In addition with the
unrolling of the race net code, this seems to fir 1% and 3.7%. However, no
template magic is needed to apply this optimizations, a simple #define
BETAHIDDEN will do. Finally, replacing (5,5) loop with unrolled code deletes
the fancy analytics applied for general loos, saving significant time. This
aéxplains why icc profited more than gcc as well.
> 2) Loops
> SSE2 vectorizes exactly 4 numbers and thus can only work on
> loops of a length divideable by 4. Furthermore, it is
> required that the exact loop length is a known at compile
> time. In neuralnet.c, we have most loops in this style:
> for( i = 0; i < pnn->cHidden; i++ )
> ar[ i ] = pnn->arHiddenThreshold[ i ];
The next question was: How can gcc performance be improved? First, it is the
release compiler so optimizations will pay. Second, there is not too much
activity in the gcc community to improve SSE performance, they spend their
time mostly on optimizations for Apple OS X / G5 / Altivec. Doh!
The times of inline assembly seem over, all compilers (including gcc)
nowadays generate close to perfect code for control flow, i.e. loops and
branches. The issue with gcc is not here, but in the number crunching code.
So I looked into that and found intrinsics, which easily replace inline
assembler. For a set of related URL see attachment.
I've produced 2 intrinsic versions of the inner loop from listing 1, one
using GCC syntax (listing 4 (C) and 5 (ASM output)) and one using icc syntax
(listing 6 (C) and 7 (ASM output)). Do to time and bug constraints, I was
not able to run the gcc version, it may well be incorrect. The icc version
worked, but actually decreased performance. I'll have to look into this in
future work. Nevertheless, here are two intrinsic examples, and I think they
will work miracles for gcc/SSE.
So what is worth knowing if we decided to code intrinsics for gcc?
1) All commands needed for "float" as compared to "double" arithmetic are in
SSE already. The only addition useful for gnubg code in SSE2 would be data
type conversion (i.e. int to float). The other commands deal with double
precision arithmetic, which gnubg does not use. So we should produce SSE
code rather than SSE2 code, which will run on a much wider base of CPUs
(e.g. including AMD) too.
2) In the early stages of development, gcc used Intel syntax for it's
intrinsics but later gcc switched to an own naming scheme. The intrinsics
are still 1:1 beside naming and the fact that Intel offers a few more, which
are macros (i.e. combination of several SSE instructions). So if intrinsics
are coded for gnubg, it seems appropriate to use an own syntax to be able to
#define compile time appearance for both Intel and gcc.
3) This is the main issue. Gnubg code is not quadword aligned at all, so at
least for Intel intrinsics will be slower than the smart loops with
alignment adjusting. On both targets, unaligned loops give away 100%
performance, so it would basically be not worth the effort. To align the
relevant data structures, two things are needed:
3.1) All parameters passed to NN methods need to be quad-aligned using
"__declspec(align(16))" in their declaration. This is Intel syntax, but I'm
sure gcc does similar.
3.2) All memory allocated must be quadword as well, for Intel that requires
using the "_mm_malloc(size, alignment)" call. To make things worse, memory
allocated this way MUST be freed by "_mm_free(addr)" rather than normal
"free(addr)" or effects are undefined. We probably needed our own routine
for aligned memory allocation here, to avoid dependencies from a specific
4) The "__declspec(align(16))" directive can be applied to variables or
typedefs. Unfortunately the arrays passed around in eval.c and neuralnet.c
are mere arrays (e.g. "float *arOutput"), no centralized typedefs are
available. This makes it a cumbersome one by one effort to align them all,
and for almost sure some will be missed. So towards SSE, it would be of
great value to introduce typedefs for the arrays, which would improve the
readability and safety of the code anyway.
That's all folks, for now. I want to see 40.000 eval/s not too soon ;)
Description: Text document