discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] Request for comment: FFT optimizations


From: CEL
Subject: Re: [Discuss-gnuradio] Request for comment: FFT optimizations
Date: Thu, 6 Jun 2019 12:57:36 +0000

Hi Albin,

no, it's not been discussed on the list, as far as my memory reaches.
We're talking about gr:fft:fft_vcc, right?

Because that output memory shifting only happens when you set shift = 1
in the constructor. I don't have any hard feelings about whether to
shift the output or perform an exp(j π n) frequency shift (a.k.a.
multiplying with +1 -1 …). My gut feeling is that copying blocks of
memory is probably significantly faster than accessing all elements of
said blocks through the CPU sequentially. Memcpy implementations are
heavily hand-assembler optimized!

However, the actual FFT would access these elements most definitely,
so, for amounts of input samples smaller than the available CPU cache,
the +1 -1 variant would act like an elegant prefetcher and probably be
a bit more efficient, but for sizes larger, it would probably even
thrash the caches a bit and mean one additional loop over all data.
So, honestly, in many cases I'd expect a small improvement, in some a
performance degradation. Not quite sure whether looking into that is
clever. 

I'd however say that it should be obvious to do the +1 -1 input
multiplication instead of the memcpy output shift if a window is
applied, because the +1 -1 can be absorbed into the window in that
case. (Watch out for the odd-length corner case where you get two
different windows depending on whether you do an odd- or an even-
numbered transform.)

There's another easy way we can optimize things at nearly zero cost:

The memcpy to the FFTW input buffer only happens so that data is
properly aligned for FFTW to be able to use SIMD. Obviously, that copy
could be avoided alltogether if the input *is* already aligned.

FFTW does have a fftwf_alignment_of() that you can use to compare the
alignment of the especially aligned fftw plan input buffer, and the
current workload input pointer.
Same applies to the output buffer of the fftw plan and the output
pointer.

If the input buffer and the work input pointer, and the output buffer
and the work output pointer, are of the same aligment, one could simply
use fftwf_execute_dft(plan, *in, *out) instead of fftwf_execute(plan),
and not memcpy anything (unless the user requests shifting).

Chances are that, considering our circular buffers are page-aligned,
and that fft_length mod 4 = 0, for most applications,

fft_length * sizeof(gr_complex_float) = (fft_length/4) * 4 * 64 

are typically already 256-byte aligned, and things are "inherently
good". 


Cheers,
Marcus

On Thu, 2019-06-06 at 11:13 +0200, Albin Stigö wrote:
> Hi,
> 
> There's a lot of memcpy/memmove going on in the FFT kernel which I
> suspect degrades performance quite a bit... This is partly because
> some platforms requires buffers to be aligned for SIMD to work and
> partly because DC is not normally centered (which is normally what you
> would like to see in a spektrum).
> 
> For ARM in particular the alignment requirements are less strict so
> that would get rid of at least one memcpy for that platform.
> 
> For centering an FFT at DC there's a nice trick, you just rotate the
> input by pi, that is multiply by 1, -1, 1, -1... I suspect that would
> be quite a bit faster than two memcpy and should be available on all
> platforms.
> 
> Has this been discussed on the list before?
> 
> 
> --Albin
> 
> _______________________________________________
> Discuss-gnuradio mailing list
> address@hidden
> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Attachment: smime.p7s
Description: S/MIME cryptographic signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]