Re: [Discuss-gnuradio] Try to improve E100's performance at high sample

On Tue, Jan 17, 2012 at 10:36 AM, Josh Blum <address@hidden> wrote:

On 01/16/2012 09:51 AM, ziyang wrote:
> On 01/13/2012 09:30 PM, Josh Blum wrote:
>>> To reduce the computation load of the processor, I tried two methods:
>>> 1) modify the gr.quadrature_demod_cf block, replace some multiplication
>>> operations with volk-based operations (gr.multiply and gr.multiply_const
>>> modules in gr_blocks);
>> I like it. Make sure to contribute patches like that back. :-)
> Actually, what I did was writing a new quadrature_demod block without
> the multiplication and delay operations, and connect extra gr.multiply
> and gr.delay blocks instead in the flow graph. Because my understanding
> is that the volk functions take a vector (multiple values) as input, and
> I didn't figure out a way to do the single-item-operation in the volk
> style.
>

I dont recommend using the extra blocks, that would probably cause more
overhead. Looking at gr_quadrature_demod_cf::work, it looks like you can
vectorize the operation of the conjugate multiply, then the atan, then
the gain scaler. So, that would be one for loop that operates on 4
samples at a time, and calls 3 volk functions.

Right now, the Volk atan2 function is only implemented for SSE and only works if libsimdmath is installed. If not, it will fall back to a generic implementation which is considerably slower than Gnuradio's LUT atan2. There's no NEON implementation, so right now the fastest option on E100 is to use Gnuradio's built-in atan2.

I spent some quality time a couple of months ago during SDR Forum writing a vectorized atan2 algorithm in Volk via Orc. I was unable to get the entire algorithm to fit within the register constraints the Orc runtime compiler applies. The end goal is to get the entire algorithm vectorized so it only needs to write out to memory once, which is going to be far faster than running three vector operations across a large buffer which won't fit into cache. I'll get back to it one of these days but it looks like parts of Orc's compiler will have to be improved. Terry, if you're interested, Orc code is easily read and looks like vector pseudocode, so my Orc implementation might be of use if you're interested in writing a custom NEON implementation for Volk. It's based on the libsimdmath implementation, which is in turn based on Cephes, and uses all sorts of Crazy Math Tricks.

--n

>> Also, you may consider timing a particular operation as a performance
>> metric, rather than counting the number of demodulated packets.
>>
> I was wondering if there are examples from which I can learn how to do
> this?

Sorry, I guess there isnt much in the way of examples.

You can time individual work functions by adding some code before an
after. We have some high resolution timers in
gruel/include/gruel/high_res_timers.h

I have also seen people time the block in a simple flow graph with a
null source, head, your_block, null_sink. You can time tb.run() and
compare run duration vs the non-vectorized code.

-Josh

_______________________________________________
Discuss-gnuradio mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

From:	Nick Foster
Subject:	Re: [Discuss-gnuradio] Try to improve E100's performance at high sample rate
Date:	Tue, 17 Jan 2012 10:54:47 -0800