In  this lab  we will  look at  how to  optimize an application that computes multiple small 1D FFT with Intel MKL for computation

0. "main.cc" contains a basic implementation that computes the FFT sequentially, using all threads for each FFT. Study the implementation, then compile and run the application on both CPU and MIC to get the base performance. (You can simply type "make run-cpu" and "make run-mic"

1. Intel MKL supports a feature called "batch mode" that can compute multiple FFTs at once. When each FFT is small, as in this case, it often is beneficial to use this mode. Implement the batch mode computation, then compile and run the application to see the performance difference. You may have modify the Makefile to increase the number of FFTs. 

Hint: Use "DftiSetValue(desc_handle, config_param, config_val)" function to modify three values: DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE, and DFTI_NUMBER_OF_TRANSFORMS. Also remember to remove the for loop that iterated through every FFT: there should only be one call to DftiComputeFoward() per iteration of trial. 

2. Test different parameters for environment variables KMP_AFFINITY and OMP_NUM_THREADS inside the Makefile to different values to see how it affects performance on both CPU and MIC.

3. Because FFT is an O(nlog(n)) workload, it tend to be memory bandwidth bound. Thus in NUMA architectures, it can have first touch allocation issues. Modify the function "init_data()" so that it is multi-threaded, then compile and run the application on both CPU and MIC and compare the performance. 

4. In some workloads alignment can have a significant effect on performance. Modify the memory allocation for the FFT so that it is aligned to a 64 byte boundary (use _mm_malloc or mkl_malloc). Compile and run on the CPU and MIC to see the performances.
