In  this lab  we will  look at  how to  optimize an application that computes multiple small LU decompositions with Intel MKL for computation. We will be focusing on how to distribute workload to OpenMP threads working within the MKL implementation.

0. "main.cc" contains a basic implementation where the matrices are decomposed one at a time. In this case threading is done internally by the library. Study the implementation, then compile and run the application on both CPU and MIC to get the base performance. 

1. Set the number of matrices to 1 and experiment with different matrix sizes on the CPU and on the MIC architecture. Can you come up with a general prescription for good matrix sizes, especially on Intel Xeon Phi coprocessors?

2. Now let us try the nested parallelism to allow for some of the thread parallelism to be done by OpenMP parallel region we added, and some of it to be done by MKL. You will first need to add an OpenMP parallel for where the LU decomposition is taking place. Then set the following environment variables (see FFT 2D labs for more on hot teams).

OMP_NESTED=1                      - Enables nested parallelism.
OMP_MAX_ACTIVE_LEVELS=2           - Levels of nested parallelism.
MKL_DYNAMIC=false                 - Disables the dynamic thread count feature of MKL. 
OMP_PROC_BIND=spread,close        - Determines thread binding to processors on multiple levels. 
OMP_PLACES=threads                - Sets the OpenMP places (see OpenMP 4.0 documentation for the definition). 
OMP_NUM_THREADS={level1},{level2} - Setting number of threads for the two levels.
KMP_HOT_TEAMS_MODE=1              - Enables hot teams 
KMP_HOT_TEAMS_MAX_LEVEL=2         - Sets the level of nesting hot teams will be active to.
*For more information on these environment variables, refer to the Intel C++ compiler reference guide

Now experiment with OMP_NUM_THREADS to see what combination of threads achieves the best performance. Be sure to keep the total number (level1 * level2) bellow the maximum number of threads. Don't be afraid to try some strange combinations (for example, try OMP_NUM_THREADS=20,12 on the MIC).

Finally, add a scheduling for OpenMP parallel for to see its effect on the performance.
