lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Using auto-vectorization


From: Greg Chicares
Subject: Re: [lmi] Using auto-vectorization
Date: Tue, 24 Jan 2017 02:26:12 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.6.0

On 2017-01-23 21:26, Vadim Zeitlin wrote:
[...with '-O3'...]
>         Time (seconds) for array0 = array1 + array2 by various methods
>  length          C             et       et/C         va       va/c
>       1      5.218e-08      5.057e-08  0.969      4.714e-08  0.903
>      10      5.515e-08      5.191e-08  0.941      4.702e-08  0.853
>      20      5.105e-08      5.043e-08  0.988      4.645e-08  0.910
>      50      6.275e-08      6.856e-08  1.093      5.653e-08  0.901
>     100      7.027e-08      7.475e-08  1.064      7.224e-08  1.028
>    1000      3.537e-07      3.505e-07  0.991      3.404e-07  0.962
>   10000      4.932e-06      4.824e-06  0.978      4.878e-06  0.989
>  100000      6.623e-05      6.552e-05  0.989      6.634e-05  1.002
[...]
>   i.e. it operates on 64 bits at once. Enabling the use of AVX (present in
>   all Intel CPUs since Sandy Bridge) replaces it with the following
[...]
>   which now operates on 128 bits at once and while it doesn't make the loop
>   twice as fast, it still results in noticeable gains starting from N as
>   small as 20 but is significantly slower for N > 10000:
> 
>         Time (seconds) for array0 = array1 + array2 by various methods
>  length          C             et       et/C         va       va/c
>       1      5.399e-08      5.068e-08  0.939      4.935e-08  0.914
>      10      4.972e-08      4.837e-08  0.973      5.022e-08  1.010
>      20      4.504e-08      4.430e-08  0.984      4.949e-08  1.099
>      50      4.780e-08      4.667e-08  0.976      5.324e-08  1.114
>     100      5.332e-08      5.256e-08  0.986      6.280e-08  1.178
>    1000      2.152e-07      2.076e-07  0.965      2.845e-07  1.322
>   10000      5.658e-06      5.525e-06  0.976      5.509e-06  0.974
>  100000      7.167e-05      6.636e-05  0.926      6.669e-05  0.930

Just to make absolutely sure I understand this, I think you're comparing
the two preceding tables, e.g.:

-O3:  100      7.027e-08      7.475e-08  1.064      7.224e-08  1.028
AVX:  100      5.332e-08      5.256e-08  0.986      6.280e-08  1.178
               faster         faster                faster

-O3 10000      4.932e-06      4.824e-06  0.978      4.878e-06  0.989
AVX 10000      5.658e-06      5.525e-06  0.976      5.509e-06  0.974
               slower         slower                slower

(After spending so many years studying financial numbers, I've gotten
into the habit of focusing on the last column in any table, and it
just so happens that the last column in the table immediately above
significantly exceeds unity starting around N=20, but then less than
unity for N=10000 and above. I.e., for N in [50,1000], the measured
time is better, but the "va/c" column is worse. Your characterization
of the numbers presented makes sense when I remind myself that you're
a scientist and a mathematician, not an accountant. So this indeed is
puzzling.)

In this 'vector_test', maybe the size of the CPU cache matters. N is
not just the number of times we go through a loop: it's the length of
the data structure.

>  I am very puzzled by the results of these benchmarks and I wonder if there
> is no problem in the benchmarking methodology itself because I just don't
> see how to explain this. I am going to try doing 2 things in order to clear
> it up, unless you think it's a needless waste of time: first, I'd like to
> change TimeAnAliquot() to also calculate the standard deviation of the
> times and maybe use it to run the test more times in order to lower it. And
> second, I'm going to rerun these tests on another machine, to check if the
> results are reproducible.

"first": TimeAnAliquot() returns the mean, and whenever I study its
results deeply I wish it returned a more robust statistic such as
the median or mode, but those would require storing the time spent
on each iteration, and the resulting overhead might perturb timings
of very simple operations like the N=1 tests here; however, those
reservations make less sense now that we have excess RAM than they
did at the beginning of the lmi epoch. See:

http://stackoverflow.com/questions/10930732/c-efficiently-calculating-a-running-median

> GC> > GC> so that we can measure what you'd prefer
> GC> 
> GC> I'm guessing that you'll prefer lambdas. If you like, you could just give
> GC> me your implementation parallel to mete_c() in each of those unit tests,
> GC> and I'll add it. Or would your preferred implementation simply replace
> GC> the classic for-loops with ranged ones? and, if so, could you show me how
> GC> you'd write mete_cxx11_typical() to parallel the other mete_*_typical(),
> GC> if that's a better test of what you have in mind?
> 
>  The tests results seem to invalidate what I had in mind, namely just
> writing the loops manually, as expression template approaches have the same
> performance as manual loops even with auto-vectorization. So now I'm not
> really sure what, if anything, needs to be done.
> 
>  FWIW I think lambdas _combined_ with a range library such as Boost.Range
> or https://github.com/ericniebler/range-v3 could be a reasonable approach
> but if we already have an expression template library, I don't see any good
> reason not to use it.

It seems that we get something like C speed either way.

> GC> BTW, due to problem-domain considerations, typical lmi
> GC> vectors are of size fifty or so, and probably always in [10, 100].
> 
>  Thanks, this is by far the most important point in this discussion.
> Knowing this now I wonder if it really matters what do we do here at all.

It matters greatly, because the goal is not simply to squeeze a
little more performance out of lmi, as discussed below.

> Unless we perform thousands of operations on arrays of this size, we could
> be shelling out to a Python interpreter for performing these operations and
> the effect on the performance would still be unnoticeable.

Illustration calculations are scalar in very large part, but there
is a lot of vector arithmetic, too; I'd take a casual guess that
there are on the order of a thousand vector-arithmetic operations
per illustration. Implementing any single operation inefficiently
wouldn't matter, but pessimizing them all would cost a great deal.

>  So, after spending quite some time on all this, I'm led to wondering what
> are we exactly trying to achieve here? Sorry for only asking this question
> now, but better late than never, I guess...

C++'s philosophy is that you shouldn't pay for what you don't use.
What I'm trying to achieve is the opposite: I want to write C++
with the expressive economy of APL, and have it run at the speed
of C. I.e., if capital letters symbolize vectors, I want to write
  A = B + ((3.0 - C) / (D + E)) - (F - G) * 0.5;
Terseness makes code easier to understand, and makes it easier to
spot errors. (That example looks more like FORTRAN than APL, but
the point is not operator-precedence-order vs. RPN; it's just to
write vector arithmetic as tidily as scalar arithmetic.)

These measurements allow us to narrow the set of techniques to be
considered, discarding anything that is noticeably slower,
especially around the N=50 "sweet spot". Thus, we won't shell out
to python after all; that idea is off the table. But all the
other ideas have made the "first cut"; they seem to have zero
cost vis-a`-vis C, and all have about the same cost in run time,
so we pick the one with greatest expressive beauty. I've long
suspected that ET would be the winner because of my APL roots,
but if lambdas let us write something that looks more like LISP
but takes about the same number of characters, my mind is open.
Terse simplicity is much more important than a particular syntax.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]