lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Using auto-vectorization


From: Vadim Zeitlin
Subject: Re: [lmi] Using auto-vectorization
Date: Mon, 23 Jan 2017 22:26:43 +0100

On Sat, 21 Jan 2017 11:21:12 +0000 Greg Chicares <address@hidden> wrote:

GC> On 2017-01-21 01:20, Vadim Zeitlin wrote:
GC> > On Fri, 20 Jan 2017 23:47:14 +0000 Greg Chicares <address@hidden> wrote:
GC> > 
GC> > GC> Would you like to propose a patch to 'expression_template_0_test.cpp'
GC> 
GC> I should also mention 'vector_test.cpp', which implements a tiny ET class
GC> and compares its performance to std::valarray and hand-written C.

 Thanks, I forgot about this one. Running it gives slightly different
results: with -O2 we get

        Time (seconds) for array0 = array1 + array2 by various methods
 length          C             et       et/C         va       va/c
      1      5.043e-08      4.850e-08  0.962      4.571e-08  0.906
     10      5.372e-08      5.187e-08  0.966      5.142e-08  0.957
     20      5.547e-08      5.448e-08  0.982      5.261e-08  0.949
     50      8.757e-08      7.518e-08  0.859      7.305e-08  0.834
    100      1.177e-07      1.058e-07  0.898      1.036e-07  0.880
   1000      7.356e-07      7.067e-07  0.961      6.884e-07  0.936
  10000      7.202e-06      7.016e-06  0.974      6.688e-06  0.929
 100000      8.970e-05      8.790e-05  0.980      8.588e-05  0.957

i.e. both "et" and "va" are faster than C, somehow. And with -O3 (which
includes auto-vectorization):

        Time (seconds) for array0 = array1 + array2 by various methods
 length          C             et       et/C         va       va/c
      1      5.218e-08      5.057e-08  0.969      4.714e-08  0.903
     10      5.515e-08      5.191e-08  0.941      4.702e-08  0.853
     20      5.105e-08      5.043e-08  0.988      4.645e-08  0.910
     50      6.275e-08      6.856e-08  1.093      5.653e-08  0.901
    100      7.027e-08      7.475e-08  1.064      7.224e-08  1.028
   1000      3.537e-07      3.505e-07  0.991      3.404e-07  0.962
  10000      4.932e-06      4.824e-06  0.978      4.878e-06  0.989
 100000      6.623e-05      6.552e-05  0.989      6.634e-05  1.002


The precision of the output doesn't reflect the actual confidence interval
at all, i.e. the numbers vary a lot between runs and you can only trust the
first digit after the point for the absolute timings. But it still seems
safe to say that:

- Auto-vectorization brings noticeable benefits starting from N=50.

- Both "et" and "va" are roughly as fast as manual C code.


 Looking at the generated assembly seems to show, additionally, that:

- The generated code is not optimal because, if I understand it correctly,
  compiler doesn't assume that the arrays are aligned on 32 byte boundary.
  I thought using an explicit __attribute__((aligned(32))) would help with
  this, but somehow it doesn't change the generated code at all and I don't
  know why. I could only convince the compiler to generate code without
  misalignment fixups by explicitly using __builtin_assume_aligned() which
  is ugly:
---------------------------------- >8 --------------------------------------
diff --git a/vector_test.cpp b/vector_test.cpp
index cc0ddcd..44d22fd 100644
--- a/vector_test.cpp
+++ b/vector_test.cpp
@@ -287,9 +287,12 @@ void demo1()

 void mete_c()
 {
+    double* const w = (double*)__builtin_assume_aligned (g_w.begin(), 32);
+    double* const u = (double*)__builtin_assume_aligned (g_u.begin(), 32);
+    double* const v = (double*)__builtin_assume_aligned (g_v.begin(), 32);
     for(int i = 0; i < g_array_length; ++i)
         {
-        g_w[i] = g_u[i] + g_v[i];
+        w[i] = u[i] + v[i];
         }
 }

---------------------------------- >8 --------------------------------------
  and doesn't seem to change the results all that much.

- The code of mete_{c,et,va} is not exactly identical, for some reason,
  but, again I don't understand why as it seems to do the same thing -- but
  in different ways, i.e. using different instructions. What I do see,
  however, is that the inner loop is exactly the same in all 3 cases and
  looks like this:

        .L:
                movupd  (%rbx,%rax), %xmm0
                addl    $1, %r8d
                addpd   (%r12,%rax), %xmm0
                movups  %xmm0, (%rdx,%rax)
                addq    $16, %rax
                cmpl    %r11d, %r8d
                jb      .L

  i.e. it operates on 64 bits at once. Enabling the use of AVX (present in
  all Intel CPUs since Sandy Bridge) replaces it with the following (don't
  pay attention to the register renaming, I don't think it's significant):

        .L:
                vmovapd (%r8,%r9), %ymm0
                addl    $1, %r10d
                vaddpd  (%rdi,%r9), %ymm0, %ymm0
                vmovapd %ymm0, (%rsi,%r9)
                addq    $32, %r9
                cmpl    %r10d, %edx
                ja      .L

  which now operates on 128 bits at once and while it doesn't make the loop
  twice as fast, it still results in noticeable gains starting from N as
  small as 20 but is significantly slower for N > 10000:

        Time (seconds) for array0 = array1 + array2 by various methods
 length          C             et       et/C         va       va/c
      1      5.399e-08      5.068e-08  0.939      4.935e-08  0.914
     10      4.972e-08      4.837e-08  0.973      5.022e-08  1.010
     20      4.504e-08      4.430e-08  0.984      4.949e-08  1.099
     50      4.780e-08      4.667e-08  0.976      5.324e-08  1.114
    100      5.332e-08      5.256e-08  0.986      6.280e-08  1.178
   1000      2.152e-07      2.076e-07  0.965      2.845e-07  1.322
  10000      5.658e-06      5.525e-06  0.976      5.509e-06  0.974
 100000      7.167e-05      6.636e-05  0.926      6.669e-05  0.930


 I am very puzzled by the results of these benchmarks and I wonder if there
is no problem in the benchmarking methodology itself because I just don't
see how to explain this. I am going to try doing 2 things in order to clear
it up, unless you think it's a needless waste of time: first, I'd like to
change TimeAnAliquot() to also calculate the standard deviation of the
times and maybe use it to run the test more times in order to lower it. And
second, I'm going to rerun these tests on another machine, to check if the
results are reproducible.


GC> > GC> so that we can measure what you'd prefer
GC> 
GC> I'm guessing that you'll prefer lambdas. If you like, you could just give
GC> me your implementation parallel to mete_c() in each of those unit tests,
GC> and I'll add it. Or would your preferred implementation simply replace
GC> the classic for-loops with ranged ones? and, if so, could you show me how
GC> you'd write mete_cxx11_typical() to parallel the other mete_*_typical(),
GC> if that's a better test of what you have in mind?

 The tests results seem to invalidate what I had in mind, namely just
writing the loops manually, as expression template approaches have the same
performance as manual loops even with auto-vectorization. So now I'm not
really sure what, if anything, needs to be done.

 FWIW I think lambdas _combined_ with a range library such as Boost.Range
or https://github.com/ericniebler/range-v3 could be a reasonable approach
but if we already have an expression template library, I don't see any good
reason not to use it.


GC> BTW, due to problem-domain considerations, typical lmi
GC> vectors are of size fifty or so, and probably always in [10, 100].

 Thanks, this is by far the most important point in this discussion.
Knowing this now I wonder if it really matters what do we do here at all.
Unless we perform thousands of operations on arrays of this size, we could
be shelling out to a Python interpreter for performing these operations and
the effect on the performance would still be unnoticeable.

 So, after spending quite some time on all this, I'm led to wondering what
are we exactly trying to achieve here? Sorry for only asking this question
now, but better late than never, I guess...

VZ


reply via email to

[Prev in Thread] Current Thread [Next in Thread]