discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] max and argmax blocks with SIMD instructions


From: Eric Blossom
Subject: Re: [Discuss-gnuradio] max and argmax blocks with SIMD instructions
Date: Mon, 23 Apr 2007 07:47:10 -0700
User-agent: Mutt/1.5.9i

On Mon, Apr 23, 2007 at 10:48:58AM +0200, Trond Danielsen wrote:
> Hi everyone,
> 
> I've written a couple of blocks for GNU Radio, but am not satisfied
> with the performance. I am therefore thinking of using SIMD
> instructions. However, I am not that familiar with x86 assembly
> instructions, and finding the reference manual on Intel's website was
> not easy. I know that DSPs such as the Blackfin has special vector
> instructions that would make this very simple, but I am not sure about
> x86.
> 
> I am also going to write a general purpose multiply and accumulate
> block that would benefit much from SIMD instructions.
> 
> Any comments are appreciated.
> 
> -- 
> Trond Danielsen

Hi Trond,

Can you point us at your code?  Before diving into SIMD, it would be
good to confirm that there isn't an easier change to make.  Have you
run oprofile on your code?

In general when going for a speed up, you want to be packaging enough
cycles in the block to have it make a difference.  I.e., I'm not sure
that a general purpose multiple-accumulate (MAC) block is going to
solve your problem.  However, if you take a look at the gr_fir_*.cc
code, you'll find that at the bottom of them they call out to SIMD
assembler in {c,}complex_dotprod_*.S that implements the kernel of the
FIR filter.  In those cases the equivalent of the MAC function is buried
in an unrolled inner loop.

With SIMD programming, a lot of the complexity is figuring out how to
schedule the loads and stores, since unless you're careful, your
performance is dominated by the memory hierarchy and not the math.

Also, on the x86 architecuture, there are not enough registers
available to hide the load latencies.  On the x86-64 it's better,
since you've got twice as many registers.  For a comparison, on the
Cell SPE you've got 128 (!) 128-bit registers.  No shortage of
registers there ;)

In addition to the "IA-32 Archicture Software Developer's Manual" (I
suspect that was the one you had trouble finding), you'll want to look
for the microarchitecture-specific optimization manuals.  The one I've
got in front of me, "Intel Pentium 4 and Intel Xeon Processor
Optimization Reference Manual" (Order Number: 248966-04) isn't the
latest, but is an example.  I suspect that there's a new one out that
covers the Pentium M, Core, Core Duo, Core 2 Duo, etc.  AMD also has
similar manuals.  All these are on the vendor web sites, typically in
the "developer" section somewhere.

Be sure to create meaningful benchmarks to measure the performance of
your code.  That's a whole art into into itself.

When all is said and done, algorithmic changes often result in bigger
wins than SIMD assembler.  Be sure to look there first.  In our
case, the FFT based FIR code is faster than the hand-coded SIMD code
for pretty much all cases where ntaps >= 20.

Have fun!
Eric




reply via email to

[Prev in Thread] Current Thread [Next in Thread]