Re: [lmi] Using auto-vectorization

lmi
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Using auto-vectorization

From:	Vadim Zeitlin
Subject:	Re: [lmi] Using auto-vectorization
Date:	Tue, 24 Jan 2017 03:49:33 +0100
On Tue, 24 Jan 2017 02:26:12 +0000 Greg Chicares <address@hidden> wrote:

GC> On 2017-01-23 21:26, Vadim Zeitlin wrote:
GC> [...with '-O3'...]
GC> >         Time (seconds) for array0 = array1 + array2 by various methods
GC> >  length          C             et       et/C         va       va/c
GC> >       1      5.218e-08      5.057e-08  0.969      4.714e-08  0.903
GC> >      10      5.515e-08      5.191e-08  0.941      4.702e-08  0.853
GC> >      20      5.105e-08      5.043e-08  0.988      4.645e-08  0.910
GC> >      50      6.275e-08      6.856e-08  1.093      5.653e-08  0.901
GC> >     100      7.027e-08      7.475e-08  1.064      7.224e-08  1.028
GC> >    1000      3.537e-07      3.505e-07  0.991      3.404e-07  0.962
GC> >   10000      4.932e-06      4.824e-06  0.978      4.878e-06  0.989
GC> >  100000      6.623e-05      6.552e-05  0.989      6.634e-05  1.002
GC> [...]
GC> >   i.e. it operates on 64 bits at once. Enabling the use of AVX (present in
GC> >   all Intel CPUs since Sandy Bridge) replaces it with the following
GC> [...]
GC> >   which now operates on 128 bits at once and while it doesn't make the 
loop
GC> >   twice as fast, it still results in noticeable gains starting from N as
GC> >   small as 20 but is significantly slower for N > 10000:
GC> > 
GC> >         Time (seconds) for array0 = array1 + array2 by various methods
GC> >  length          C             et       et/C         va       va/c
GC> >       1      5.399e-08      5.068e-08  0.939      4.935e-08  0.914
GC> >      10      4.972e-08      4.837e-08  0.973      5.022e-08  1.010
GC> >      20      4.504e-08      4.430e-08  0.984      4.949e-08  1.099
GC> >      50      4.780e-08      4.667e-08  0.976      5.324e-08  1.114
GC> >     100      5.332e-08      5.256e-08  0.986      6.280e-08  1.178
GC> >    1000      2.152e-07      2.076e-07  0.965      2.845e-07  1.322
GC> >   10000      5.658e-06      5.525e-06  0.976      5.509e-06  0.974
GC> >  100000      7.167e-05      6.636e-05  0.926      6.669e-05  0.930
GC> 
GC> Just to make absolutely sure I understand this, I think you're comparing
GC> the two preceding tables, e.g.:
GC> 
GC> -O3:  100      7.027e-08      7.475e-08  1.064      7.224e-08  1.028
GC> AVX:  100      5.332e-08      5.256e-08  0.986      6.280e-08  1.178
GC>                faster         faster                faster
GC> 
GC> -O3 10000      4.932e-06      4.824e-06  0.978      4.878e-06  0.989
GC> AVX 10000      5.658e-06      5.525e-06  0.976      5.509e-06  0.974
GC>                slower         slower                slower

 Yes, this was what I was doing and this is indeed what I was puzzled by.

GC> In this 'vector_test', maybe the size of the CPU cache matters. N is
GC> not just the number of times we go through a loop: it's the length of
GC> the data structure.

 Yes, but why would it be affected by the instructions used to iterate in
the loop?

GC> >  I am very puzzled by the results of these benchmarks and I wonder if 
there
GC> > is no problem in the benchmarking methodology itself because I just don't
GC> > see how to explain this. I am going to try doing 2 things in order to 
clear
GC> > it up, unless you think it's a needless waste of time: first, I'd like to
GC> > change TimeAnAliquot() to also calculate the standard deviation of the
GC> > times and maybe use it to run the test more times in order to lower it. 
And
GC> > second, I'm going to rerun these tests on another machine, to check if the
GC> > results are reproducible.
GC> 
GC> "first": TimeAnAliquot() returns the mean, and whenever I study its
GC> results deeply I wish it returned a more robust statistic such as
GC> the median or mode, but those would require storing the time spent
GC> on each iteration,

 For the standard deviation it wouldn't, I'm almost sure that I've already
posted the code I use for computing on the fly implementing the algorithm
given in the section 4.2.2 of "TAoCP, Volume 2: Seminumerical Algorithms".

GC> and the resulting overhead might perturb timings of very simple
GC> operations like the N=1 tests here; however, those reservations make
GC> less sense now that we have excess RAM than they did at the beginning
GC> of the lmi epoch. See:
GC> 
GC> 
http://stackoverflow.com/questions/10930732/c-efficiently-calculating-a-running-median

 This also looks interesting, thanks.

GC> These measurements allow us to narrow the set of techniques to be
GC> considered, discarding anything that is noticeably slower,
GC> especially around the N=50 "sweet spot". Thus, we won't shell out
GC> to python after all; that idea is off the table. But all the
GC> other ideas have made the "first cut"; they seem to have zero
GC> cost vis-a`-vis C, and all have about the same cost in run time,
GC> so we pick the one with greatest expressive beauty. I've long
GC> suspected that ET would be the winner because of my APL roots,
GC> but if lambdas let us write something that looks more like LISP
GC> but takes about the same number of characters, my mind is open.
GC> Terse simplicity is much more important than a particular syntax.

 I don't see how lambdas can beat the dedicated class based on ET and
providing the natural syntax for all the operations. We could implement
something less horrible than the code in mete_stl_fancy() in the ET test,
but lambdas would still have some unavoidable syntactic overhead.

 Again, I thought that code using ET might not be auto-vectorizable, while
I knew that manually written loops certainly were (although sometimes some
extra effort is required even for those, as my struggles with alignment
confirmed), and I strongly suspected that code using lambdas would be too,
because their code can be fully inlined giving the compiler full visibility
and allowing it to optimize it just as well as manually written loops. But
I was wrong and ET-based code seems to profit from auto-vectorization just
as well as everything else, so I don't see any reason to use anything else,
especially if the code clarity and simplicity are the most important
criteria.

 Now, whether using the particular PETE library is the best choice in 2017
is another question and I suspect that it isn't, but I'm not aware of any
critical problems with it neither.

 So, I guess, I'm still not sure what, if anything, should be done here? I
can spend a lot of time profiling/benchmarking/debugging and it probably
will result in at least some useful insights, but I can't propose any
syntax better than the current ET-based one and so I'm still not sure what
is my goal here.

 Regards,
VZ
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [lmi] Replacing boost with std C++11 [Was: Fix value_cast defect shown by the unit test], (continued)
Prev by Date: Re: [lmi] Request for code review
Next by Date: Re: [lmi] Using auto-vectorization
Previous by thread: Re: [lmi] Using auto-vectorization
Next by thread: Re: [lmi] Using auto-vectorization
Index(es):
- Date
- Thread