In  this lab  we will  look at  how to  optimize an application that computes multiple small 2D FFT with Intel MKL for computation. We will be focusing on how to distribute workload to OpenMP threads working within the MKL implementation.

0. "main.cc" contains a batch mode implementation for 2D FFT. In this case threading is done internally by the library. Study the implementation, then compile and run the application on both CPU and MIC to get the base performance. 

1. The original implementation relies on the MKL library for threading. Let us take it to the other extreme and implement all multi-threading using OpenMP parallel regions, and have each FFT be single threaded. There are several methods for doing this (see ); let us use the method where every thread gets its own handle. By default, MKL FFT will default to a single threaded implementation if "DftiComputeForward()" is called from within a parallel region, so you do not have to set DFTI_THREAD_LIMIT.

Implement this model of thread parallelism with our 2D FFT application. Then compile and run the application to see the performance. Hint: You will likely see a drop in performance with this implementation.

2. Now let us try the middle ground: we will use nested parallelism to allow for some of the thread parallelism to be done by OpenMP parallel region we added, and some of it to be done by MKL. If you did not add thread count specific code (like setting DFTI_THREAD_LIMIT) aside from the handle initialization, you should not have to modify the source code for this step. Instead, you will need to set the following environment variables


OMP_NESTED=1                      - Enables nested parallelism.
OMP_MAX_ACTIVE_LEVELS=2           - Levels of nested parallelism.
MKL_DYNAMIC=false                 - Disables the dynamic thread count feature of MKL. 
OMP_PROC_BIND=spread,close        - Determines thread binding to processors on multiple levels. 
OMP_PLACES=threads                - Sets the OpenMP places (see OpenMP 4.0 documentation for the definition). 
OMP_NUM_THREADS={level1},{level2} - Setting number of threads for the two levels.
*For more information on these environment variables, refer to the Intel C++ compiler reference guide and MKL Reference guide (for MKL_DYNAMIC)

At this point, you should see that the performance plummets. This is due to two factors: that MKL also uses OpenMP internally to effect multi-threading, and creating and destroying OpenMP region can be costly. To combat this, OpenMP has a feature called hot-teams that keeps threads "hot", so that transition from one region to the next is smoother. To enable this feature for nested parallelism, use.

KMP_HOT_TEAMS_MODE=1              - Enables hot teams 
KMP_HOT_TEAMS_MAX_LEVEL=2         - Sets the level of nesting hot teams will be active to.
*For more information on these environment variables, refer to the Intel C++ compiler reference guide

Now experiment with OMP_NUM_THREADS to see what combination of threads achieves the best performance. Be sure to keep the total number (level1 * level2) bellow the maximum number of threads. Don't be afraid to try some strange combinations (for example, try OMP_NUM_THREADS=20,12 on the MIC).
