Is there any way to compute 1D FFT of 2D FFT in another dimension without transposition using intel mkl?

Question

I want to use the mkl to compute 1D FFT of a 2D array stored as a 1D array. for example,

for (int j=0; j<NJ; j++) //rows
{
  for (int i=0; i<NI; i++) //columns
   {
     Pre_2D_array[i+j*NI].x=1.0;
     Pre_2D_array[i+j*NI].y=2.0;
   }
}

I want to compute 1D FFT of the Pre_2D_array in row dimension. The only way I can think out is reshaping the array and the doing the FFT like this,

   for (int i=0; i<NI; i++) //columns
    {
      for (int j=0; j<NJ; j++) //rows
       {
         2D_array[j+i*NJ]=Pre_2D_array[i+j*NI];
       }
    }

DFTI_DESCRIPTOR_HANDLE desc_x = 0;
DftiCreateDescriptor(&desc_x, DFTI_PREC, DFTI_COMPLEX, 1, NJ);
DftiSetValue(desc_x, DFTI_NUMBER_OF_TRANSFORMS, NI);
DftiSetValue(desc_x, DFTI_INPUT_DISTANCE,  NJ);
DftiCommitDescriptor(desc_x);

DftiComputeForward(desc_x, 2D_array);

Althougth this can get the right answer. But doing the transposition(reshaping) of the original array wastes too much time when the array is big. Is there any way to do the FFT without reshaping the array? Or any fast way to reshape the array as fast as possible?

the cpuinfo is:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 79
model name  : Intel(R) Xeon(R) CPU E5-2648L v4 @ 1.80GHz
stepping    : 1
microcode   : 0xb000022
cpu MHz     : 1795.882
cache size  : 35840 KB
physical id : 0
siblings    : 14
core id     : 0
cpu cores   : 14
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 20
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap
bogomips    : 3591.76
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

score 1 · Answer 1 · answered Mar 08 '19 at 14:20

As far as I can tell, the Intel MKL doesn't provide the capability to perform FFTs in data with a stride between data elements.

However, FFTW does. Per 4.4.1 Advanced Complex DFTs of the FFTW documentation:

fftw_plan fftw_plan_many_dft(int rank, const int *n, int howmany,
                             fftw_complex *in, const int *inembed,
                             int istride, int idist,
                             fftw_complex *out, const int *onembed,
                             int ostride, int odist,
                             int sign, unsigned flags);
This routine plans multiple multidimensional complex DFTs, and it extends the fftw_plan_dft routine (see Complex DFTs) to compute howmany transforms, each having rank rank and size n. In addition, the transform data need not be contiguous, but it may be laid out in memory with an arbitrary stride. To account for these possibilities, fftw_plan_many_dft adds the new parameters howmany, {i,o}nembed, {i,o}stride, and {i,o}dist. The FFTW basic interface (see Complex DFTs) provides routines specialized for ranks 1, 2, and 3, but the advanced interface handles only the general-rank case.

howmany is the (nonnegative) number of transforms to compute. The resulting plan computes howmany transforms, where the input of the k-th transform is at location in+k*idist (in C pointer arithmetic), and its output is at location out+k*odist. Plans obtained in this way can often be faster than calling FFTW multiple times for the individual transforms. The basic fftw_plan_dft interface corresponds to howmany=1 (in which case the dist parameters are ignored).

Each of the howmany transforms has rank rank and size n, as in the basic interface. In addition, the advanced interface allows the input and output arrays of each transform to be row-major subarrays of larger rank-rank arrays, described by inembed and onembed parameters, respectively. {i,o}nembed must be arrays of length rank, and n should be elementwise less than or equal to {i,o}nembed. Passing NULL for an nembed parameter is equivalent to passing n (i.e. same physical and logical dimensions, as in the basic interface.)

The stride parameters indicate that the j-th element of the input or output arrays is located at j*istride or j*ostride, respectively. (For a multi-dimensional array, j is the ordinary row-major index.) When combined with the k-th transform in a howmany loop, from above, this means that the (j,k)-th element is at j*stride+k*dist. (The basic fftw_plan_dft interface corresponds to a stride of 1.)

For in-place transforms, the input and output stride and dist parameters should be the same; otherwise, the planner may return NULL.

The page conveniently supplies a (somewhat confusing) example of performing 1D FFTs on the columns of a 2D array:

Transform each column of a 2d array with 10 rows and 3 columns:

   int rank = 1; /* not 2: we are computing 1d transforms */
   int n[] = {10}; /* 1d transforms of length 10 */
   int howmany = 3;
   int idist = odist = 1;
   int istride = ostride = 3; /* distance between two elements in 
                                 the same column */
   int *inembed = n, *onembed = n;

Also see How do I use fftw_plan_many_dft on a transposed array of data? for more examples.

Thank you for providing another way to solve it. But now I only want to use MKL to solve it. @francis showed that the Intel MKL also provide the function to perform FFTs in data with a stride between data elements. — Jie.Chen, Mar 10 '19 at 02:53

francis · Accepted Answer · 2019-03-09T17:26:07.407

1

The FFTW library has introduced arguments istride or ostride in functions like fftw_plan_many_dft() to avoid transposing the array. The last example on that page is a DFT on the second dimension.

Similarly, the Intel math kernel library introduces data layout configuration parameters such as DFTI_INPUT_STRIDES and DFTI_OUTPUT_STRIDES or DFTI_NUMBER_OF_TRANSFORMS.

The DFT over the second dimension may look like (I did not tested it):

DftiCreateDescriptor(&desc_x, DFTI_PREC, DFTI_COMPLEX, 1, NJ);
DftiSetValue(desc_x, DFTI_NUMBER_OF_TRANSFORMS, NI);
DftiSetValue(desc_x, DFTI_INPUT_STRIDES, &NI);
DftiSetValue(desc_x, DFTI_OUTPUT_STRIDES, &NI);
DftiSetValue(desc_x, DFTI_INPUT_DISTANCE,  1);
DftiSetValue(desc_x, DFTI_OUTPUT_DISTANCE,  1);
DftiCommitDescriptor(desc_x);

DFTI_OUTPUT_STRIDES is ignored for in-place transforms (DFTI_PLACEMENT=DFTI_INPLACE).

edited Mar 09 '19 at 17:26

answered Mar 08 '19 at 14:25

francis

9,525
2
25
41

It will return "Segment fault" if I set the desc_x according to your codes. And I cann't find the problem. – Jie.Chen Mar 09 '19 at 08:39
Looking at the documentation, I just rode that `DFTI_INPUT_STRIDES` is supposed to be an integer array. Could you try `DftiSetValue(desc_x, DFTI_INPUT_STRIDES, &NI);` and `DftiSetValue(desc_x, DFTI_OUTPUT_STRIDES, &NI);`? – francis Mar 09 '19 at 10:28
Oh，I forgot it！ Your answer really works for me! Thank you. @francis – Jie.Chen Mar 10 '19 at 02:49

score 0 · Answer 3 · answered Mar 07 '19 at 14:51

0

You cannot do 1D FFTs along higher dimensions of your data. You need to do transposition first so that the FT dimension is the one, where the data is contiguous in RAM.

However, it is not as bad as you might think. On a multi core machine, you can easily setup some threads, whose sole job is to pre / post arrange FT data.

answered Mar 07 '19 at 14:51

Kaveh Vahedipour

3,412
1
14
22

Thanks you for answering. Is there any functions to do the transposition instead of using the "for(int i=0;...)" to do the cycle. – Jie.Chen Mar 08 '19 at 02:45
How large are your matrices? – Kaveh Vahedipour Mar 08 '19 at 09:31
The size of my matrices is 4096*2048. I want to do the 1D FFT on 2048, but the matrices are stored by 4096 dimension first. – Jie.Chen Mar 08 '19 at 13:17
Awesome, and what does `cat /proc/cpuinfo` say? – Kaveh Vahedipour Mar 08 '19 at 13:17
The CPU is E5-2648L v4 with 14 cores. But I can only use one thread on one core to compute it. Because other cores have the same mission. – Jie.Chen Mar 08 '19 at 13:23
Gotcha, i needed the cpuinfo output to not have to check myself, what intrisics are available – Kaveh Vahedipour Mar 08 '19 at 13:24
I have edit my question by adding the cpuinfo. @Kaveh Vahedipour – Jie.Chen Mar 08 '19 at 13:54
Cool, supports AVX. Will get back to you. Here's the strategy. Right generic code to transpose the matrix with AVX intrinsics. Will write it for https://github.com/kvahed/codeare and you can directly use it. – Kaveh Vahedipour Mar 08 '19 at 13:56
Thank you very much. Looking forward to seeing your codes! – Jie.Chen Mar 08 '19 at 14:08

Is there any way to compute 1D FFT of 2D FFT in another dimension without transposition using intel mkl?

3 Answers3