[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Discuss-gnuradio] Request for comment: FFT optimizations
From: |
CEL |
Subject: |
Re: [Discuss-gnuradio] Request for comment: FFT optimizations |
Date: |
Fri, 7 Jun 2019 08:12:06 +0000 |
Hi Johannes,
always remember that the "f" suffix after fftw is just for "float",
without that suffix you use the double precision FFTw.
> fftwf_execute_dft(plan, *in, *out)
http://www.fftw.org/fftw3_doc/New_002darray-Execute-Functions.html
> fftwf_alignment_of()
same page, and
http://www.fftw.org/fftw3_doc/Planner-Flags.html#index-fftw_005falignment_005fof
under "FFTW_UNALIGNED".
also, and this might be a bit of a downer: use the source, Luke!
(also, this might even be more of a downer:)
fftw3/kernel/align.c, reduced by all false #if:
#define ALGN 16
int X(ialignment_of)(R *p)
{
return (int)(((uintptr_t) p) % ALGN);
}
or in other words:
address % 16
However, we should really be using the fftw function for that, in case
they get around to writing say AVX7 FFT kernels, which need 2048-byte
aligned memory instructions or so¹.
Best regards,
Marcus
¹if you're from the year 2028, and are reading up on intels new AVX7
instruction on the GNU Radio mailing list archive: this is a joke.
Obviously, AVX7 only needs 1024-byte alignment and that you only use
right-hand quantum registers.
On Fri, 2019-06-07 at 07:51 +0000, Johannes Demel wrote:
> Hi Marcus,
>
> do you have any online resources that point to the official
> documentation of `fftwf_execute_dft(plan, *in, *out)` and
> `fftwf_alignment_of()`? (for the latter this mail thread is on the
> first
> page of my Google results.)
> All the documentation I could find is [0] and this seems like some
> Intel
> wrapper that is not part of FFTW?!
>
> Cheers
> Johannes
>
> [0]
> http://sep.stanford.edu/sep/claudio/Research/Prst_ExpRefl/ShtPSPI/intel/mkl/10.0.3.020/interfaces/fftw3xf/wrappers/fftwf_execute_dft.c
>
> Am 06.06.19 um 14:57 schrieb Müller, Marcus (CEL):
> > Hi Albin,
> >
> > no, it's not been discussed on the list, as far as my memory
> > reaches.
> > We're talking about gr:fft:fft_vcc, right?
> >
> > Because that output memory shifting only happens when you set shift
> > = 1
> > in the constructor. I don't have any hard feelings about whether to
> > shift the output or perform an exp(j π n) frequency shift (a.k.a.
> > multiplying with +1 -1 …). My gut feeling is that copying blocks of
> > memory is probably significantly faster than accessing all elements
> > of
> > said blocks through the CPU sequentially. Memcpy implementations
> > are
> > heavily hand-assembler optimized!
> >
> > However, the actual FFT would access these elements most
> > definitely,
> > so, for amounts of input samples smaller than the available CPU
> > cache,
> > the +1 -1 variant would act like an elegant prefetcher and probably
> > be
> > a bit more efficient, but for sizes larger, it would probably even
> > thrash the caches a bit and mean one additional loop over all data.
> > So, honestly, in many cases I'd expect a small improvement, in some
> > a
> > performance degradation. Not quite sure whether looking into that
> > is
> > clever.
> >
> > I'd however say that it should be obvious to do the +1 -1 input
> > multiplication instead of the memcpy output shift if a window is
> > applied, because the +1 -1 can be absorbed into the window in that
> > case. (Watch out for the odd-length corner case where you get two
> > different windows depending on whether you do an odd- or an even-
> > numbered transform.)
> >
> > There's another easy way we can optimize things at nearly zero
> > cost:
> >
> > The memcpy to the FFTW input buffer only happens so that data is
> > properly aligned for FFTW to be able to use SIMD. Obviously, that
> > copy
> > could be avoided alltogether if the input *is* already aligned.
> >
> > FFTW does have a fftwf_alignment_of() that you can use to compare
> > the
> > alignment of the especially aligned fftw plan input buffer, and the
> > current workload input pointer.
> > Same applies to the output buffer of the fftw plan and the output
> > pointer.
> >
> > If the input buffer and the work input pointer, and the output
> > buffer
> > and the work output pointer, are of the same aligment, one could
> > simply
> > use fftwf_execute_dft(plan, *in, *out) instead of
> > fftwf_execute(plan),
> > and not memcpy anything (unless the user requests shifting).
> >
> > Chances are that, considering our circular buffers are page-
> > aligned,
> > and that fft_length mod 4 = 0, for most applications,
> >
> > fft_length * sizeof(gr_complex_float) = (fft_length/4) * 4 * 64
> >
> > are typically already 256-byte aligned, and things are "inherently
> > good".
> >
> >
> > Cheers,
> > Marcus
> >
> > On Thu, 2019-06-06 at 11:13 +0200, Albin Stigö wrote:
> > > Hi,
> > >
> > > There's a lot of memcpy/memmove going on in the FFT kernel which
> > > I
> > > suspect degrades performance quite a bit... This is partly
> > > because
> > > some platforms requires buffers to be aligned for SIMD to work
> > > and
> > > partly because DC is not normally centered (which is normally
> > > what you
> > > would like to see in a spektrum).
> > >
> > > For ARM in particular the alignment requirements are less strict
> > > so
> > > that would get rid of at least one memcpy for that platform.
> > >
> > > For centering an FFT at DC there's a nice trick, you just rotate
> > > the
> > > input by pi, that is multiply by 1, -1, 1, -1... I suspect that
> > > would
> > > be quite a bit faster than two memcpy and should be available on
> > > all
> > > platforms.
> > >
> > > Has this been discussed on the list before?
> > >
> > >
> > > --Albin
> > >
> > > _______________________________________________
> > > Discuss-gnuradio mailing list
> > > address@hidden
> > > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
> > >
> > > _______________________________________________
> > > Discuss-gnuradio mailing list
> > > address@hidden
> > > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
smime.p7s
Description: S/MIME cryptographic signature