discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] Request for comment: FFT optimizations


From: CEL
Subject: Re: [Discuss-gnuradio] Request for comment: FFT optimizations
Date: Fri, 7 Jun 2019 08:12:06 +0000

Hi Johannes,

always remember that the "f" suffix after fftw is just for "float",
without that suffix you use the double precision FFTw.

> fftwf_execute_dft(plan, *in, *out) 

http://www.fftw.org/fftw3_doc/New_002darray-Execute-Functions.html

> fftwf_alignment_of()

same page, and

http://www.fftw.org/fftw3_doc/Planner-Flags.html#index-fftw_005falignment_005fof

under "FFTW_UNALIGNED".

also, and this might be a bit of a downer: use the source, Luke!

(also, this might even be more of a downer:)
fftw3/kernel/align.c, reduced by all false #if:

#define ALGN 16
int X(ialignment_of)(R *p)
{
     return (int)(((uintptr_t) p) % ALGN);
}

or in other words:

address % 16

However, we should really be using the fftw function for that, in case
they get around to writing say AVX7 FFT kernels, which need 2048-byte
aligned memory instructions or so¹.

Best regards,
Marcus

¹if you're from the year 2028, and are reading up on intels new AVX7
instruction on the GNU Radio mailing list archive: this is a joke.
Obviously, AVX7 only needs 1024-byte alignment and that you only use
right-hand quantum registers.

On Fri, 2019-06-07 at 07:51 +0000, Johannes Demel wrote:
> Hi Marcus,
> 
> do you have any online resources that point to the official 
> documentation of `fftwf_execute_dft(plan, *in, *out)` and 
> `fftwf_alignment_of()`? (for the latter this mail thread is on the
> first 
> page of my Google results.)
> All the documentation I could find is [0] and this seems like some
> Intel 
> wrapper that is not part of FFTW?!
> 
> Cheers
> Johannes
> 
> [0] 
> http://sep.stanford.edu/sep/claudio/Research/Prst_ExpRefl/ShtPSPI/intel/mkl/10.0.3.020/interfaces/fftw3xf/wrappers/fftwf_execute_dft.c
> 
> Am 06.06.19 um 14:57 schrieb Müller, Marcus (CEL):
> > Hi Albin,
> > 
> > no, it's not been discussed on the list, as far as my memory
> > reaches.
> > We're talking about gr:fft:fft_vcc, right?
> > 
> > Because that output memory shifting only happens when you set shift
> > = 1
> > in the constructor. I don't have any hard feelings about whether to
> > shift the output or perform an exp(j π n) frequency shift (a.k.a.
> > multiplying with +1 -1 …). My gut feeling is that copying blocks of
> > memory is probably significantly faster than accessing all elements
> > of
> > said blocks through the CPU sequentially. Memcpy implementations
> > are
> > heavily hand-assembler optimized!
> > 
> > However, the actual FFT would access these elements most
> > definitely,
> > so, for amounts of input samples smaller than the available CPU
> > cache,
> > the +1 -1 variant would act like an elegant prefetcher and probably
> > be
> > a bit more efficient, but for sizes larger, it would probably even
> > thrash the caches a bit and mean one additional loop over all data.
> > So, honestly, in many cases I'd expect a small improvement, in some
> > a
> > performance degradation. Not quite sure whether looking into that
> > is
> > clever.
> > 
> > I'd however say that it should be obvious to do the +1 -1 input
> > multiplication instead of the memcpy output shift if a window is
> > applied, because the +1 -1 can be absorbed into the window in that
> > case. (Watch out for the odd-length corner case where you get two
> > different windows depending on whether you do an odd- or an even-
> > numbered transform.)
> > 
> > There's another easy way we can optimize things at nearly zero
> > cost:
> > 
> > The memcpy to the FFTW input buffer only happens so that data is
> > properly aligned for FFTW to be able to use SIMD. Obviously, that
> > copy
> > could be avoided alltogether if the input *is* already aligned.
> > 
> > FFTW does have a fftwf_alignment_of() that you can use to compare
> > the
> > alignment of the especially aligned fftw plan input buffer, and the
> > current workload input pointer.
> > Same applies to the output buffer of the fftw plan and the output
> > pointer.
> > 
> > If the input buffer and the work input pointer, and the output
> > buffer
> > and the work output pointer, are of the same aligment, one could
> > simply
> > use fftwf_execute_dft(plan, *in, *out) instead of
> > fftwf_execute(plan),
> > and not memcpy anything (unless the user requests shifting).
> > 
> > Chances are that, considering our circular buffers are page-
> > aligned,
> > and that fft_length mod 4 = 0, for most applications,
> > 
> > fft_length * sizeof(gr_complex_float) = (fft_length/4) * 4 * 64
> > 
> > are typically already 256-byte aligned, and things are "inherently
> > good".
> > 
> > 
> > Cheers,
> > Marcus
> > 
> > On Thu, 2019-06-06 at 11:13 +0200, Albin Stigö wrote:
> > > Hi,
> > > 
> > > There's a lot of memcpy/memmove going on in the FFT kernel which
> > > I
> > > suspect degrades performance quite a bit... This is partly
> > > because
> > > some platforms requires buffers to be aligned for SIMD to work
> > > and
> > > partly because DC is not normally centered (which is normally
> > > what you
> > > would like to see in a spektrum).
> > > 
> > > For ARM in particular the alignment requirements are less strict
> > > so
> > > that would get rid of at least one memcpy for that platform.
> > > 
> > > For centering an FFT at DC there's a nice trick, you just rotate
> > > the
> > > input by pi, that is multiply by 1, -1, 1, -1... I suspect that
> > > would
> > > be quite a bit faster than two memcpy and should be available on
> > > all
> > > platforms.
> > > 
> > > Has this been discussed on the list before?
> > > 
> > > 
> > > --Albin
> > > 
> > > _______________________________________________
> > > Discuss-gnuradio mailing list
> > > address@hidden
> > > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
> > > 
> > > _______________________________________________
> > > Discuss-gnuradio mailing list
> > > address@hidden
> > > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Attachment: smime.p7s
Description: S/MIME cryptographic signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]