discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] VOLK division between complexes


From: Marcus Müller
Subject: Re: [Discuss-gnuradio] VOLK division between complexes
Date: Sat, 14 May 2016 19:47:39 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0

Hi Federico,

could you have a look at this branch [1] and tell me whether it works for you? It's SSE3 and AVX only, so far.
Basically, you could

cd /path/to/source/gnuradio/volk
git pull https://github.com/marcusmueller/volk complex_division
cd ../build
make
make install

and use volk_32fc_x2_divide_32fc in your code.

Best regards,
Marcus

[1] https://github.com/marcusmueller/volk/tree/complex_division

On 11.05.2016 23:56, Federico Larroca wrote:
Thank you very much for your quick answers!
Marcus (Leech), I found the function you mentioned minutes after I sent the mail. Although it apparently works, Performance Monitor is behaving really weird when I use it. I have to look up that.
Marcus (Müller), a very informative answer indeed. I will see if I can get that endless fame you mention :-).
In any case, I'll post what I finally did and the performance gain achieved.
Best
Federico


2016-05-11 17:47 GMT-03:00 Marcus Müller <address@hidden>:
Hi Federico,


On 11.05.2016 21:09, Federico Larroca wrote:
Hello everyone,
We are on the stage of optimizing our project (gr-isdbt).
Awesome!
One of the most consuming blocks is OFDM synchronization, and in particular the equalization phase. This is simply the division between the input signal and the estimated channel gains (two modestly big arrays of ~5000 complexes for each OFDM symbol).
Until now, this was performed by a for loop, so my plan was to change it for a volk function. However, there is no complex division in VOLK. So I've done a rather indirect operation using the property that a/b = a*conj(b)/|b|^2, resulting in six lines of code (a multiply conjugate, a magnitude squared, a deinterleave, a couple of float divisions and an interleave). Obviously the performance gain (measured with the Performance Monitor) is marginal (to be optimistic)...
I have to admit, I'd expect your "simple" for loop doing something like

void yourclass::normalize(std::complex<float> *a, std::complex<float> *b) {
    for(size_t idx; idx < a_len; ++idx)
       a[idx] /= b[idx];
}

to be neatly optimizable by the compiler, at least if it knows that a and b aren't pointing at the same memory-

Your approach,
$\frac
                ab = a \cdot \frac{b^*}{|b|^2}= a \cdot
                \frac{b^*}{b\,b^*} = a \cdot \frac 1b$
is correct; however, in C++ with std::complex<>
a/b
pretty much does that already (ugly std lib C++ ahead, from /usr/include/c++/<version>/complex):
  // XXX: This is a grammar school implementation.
  template<typename _Tp>
    template<typename _Up>
    complex<_Tp>&
    complex<_Tp>::operator/=(const complex<_Up>& __z)
    {
      const _Tp __r =  _M_real * __z.real() + _M_imag * __z.imag();
      const _Tp __n = std::norm(__z);
      _M_imag = (_M_imag * __z.real() - _M_real * __z.imag()) / __n;
      _M_real = __r / __n;
      return *this;
    }
And the problem is that while doing that for every a and b separately might mean you can't make full use of SIMD instructions to eg. do four complex divisions at once, it avoids having to load and store original / intermediate values from/to RAM. Basically, your CPU might not be the bottleneck – RAM could be, and doing everything you need for a single division at once, even if done without any optimization, might be faster than incurring additional memory transfers. That's because your memory controller pre-fetches whole cache lines worth of values when getting the first elements of a and b, and working on values from cache is significantly (read: factor > 50) than a single memory transfer.

So, my immediate recommendation really is to keep your loop as minimal as possible, giving your compiler a solid chance to see the potential for optimization. There might not be much you can do. Even hand-written VOLK kernels aren't always faster than automatically generated optimized machine code.
Does anyone has a better idea? Implementing a new kernel is simply out of my knowledge scope.
Ha! But it would mean endless (additional) fame!
Soooo: look at the volk_32fc_x2_multiply_conjugate_32fc.h kernel source. Specifically, at the SSE3 implementation, volk_32fc_x2_multiply_conjugate_32fc_u_sse3(…).
You'll notice line 134:
     z = _mm_complexconjugatemul_ps(x, y);

As you can see, there's a a "VOLK intrinsic",
_mm_complexconjugatemul_ps
which is defined in volk_intrinsics.h. That same file contains _mm_magnitudesquared_ps_sse3 . Maybe you can make something clever out of that :)

Best regards,
Marcus


[1] https://gcc.gnu.org/onlinedocs/gcc/Restricted-Pointers.html

_______________________________________________
Discuss-gnuradio mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio




reply via email to

[Prev in Thread] Current Thread [Next in Thread]