HOW Series “Deep Dive”: Webinars on Performance Optimization – 2017 Edition


In a Nutshell

HOW Series “Deep Dive” is a free Web-based training on parallel programming and performance optimization on Intel architecture. The workshop includes 20 hours of instruction and up to 2 weeks of remote access to dedicated training servers for hands-on exercises. This training is free to everyone thanks to Intel’s sponsorship.


You can get trained in one of the two ways:

Self-paced: Start Right Now

You can access the video recordings of lectures, slides of presentations and code of practical exercises on this page using a free Colfax Research account.

This option is free and open to everyone, however, self-paced study does not give you the benefits that you get by joining a workshop (which is also free, but tied to specific dates).


Join a Workshop: Interact and Practice

You can join one of the 2-week long HOW series workshops to get these additional benefits:

  • Access a dedicated computing cluster to run exercises
  • Get reminders to join daily online broadcasts of recordings
  • Live chat with instructor during broadcasts
  • Receive a certificate of completion


Upcoming Workshops:

(the 10-day workshops occurring in different months have the same agenda)

October 2017
    — 16:00 GMT (for America/Europe)
November 2017
    — 04:00 GMT (for Asia/Pacific) + 16:00 GMT on same day

All times and dates are in the GMT time zone. Conversion chart from GMT to your time zone *:

GMT (24h) Your time (24h)
04:00 240
16:00 960
* For correct conversion, check this:

  1. We used your browser time to place you in . Is this the correct time zone for you?
  2. Does your locale change to/from daylight saving time before or during the workshop?
  3. Does the time difference put you into the next or previous day?


Do you have a free account at Colfax Research? Save time by logging in - we will fill in the fields with data from your profile.

The registration form should appear here in a moment. If you don't see a registration form, it could be because your browser

  • has JavaScript disabled
  • is running an advertisement blocking tool
  • is too old

Please try to enable JavaScript, disable the ad block, upgrade your browser, or try a different one.
If nothing helps, contact us for help.


Why Attend the HOW Series

Here is what the HOW Series training will deliver for you:

Learn Modern Code

Are you realizing the payoff of parallel processing? Are you aware that without code optimization, computational applications may perform orders of magnitude worse than they are supposed to?

The Web-based HOW Series training provides the extensive knowledge needed to extract more of the parallel compute performance potential found in both Intel® Xeon® and Intel® Xeon Phi™ processors and coprocessors. Course materials and practical exercises are appropriate for developers beginning their journey to parallel programming, with enough detail to also cater to high-performance computing experts.

Jump to detailed descriptions of sessions to see what you will learn.

Practice New Skills

The HOW series is an experiential learning program because you get to see code optimization performed live and also get to practice it with your own hands. The workshop consists of instructional and hands-on self-study components:

The instructional part of each workshop consists of 10 lecture sessions. Each session presents 1 hour of theory and 1 hour of practical demonstrations. Lectures can be viewed live by joining live broadcast, or offline as streaming video.

In the self-study part, for the duration of the workshop attendees are provided with remote access over SSH to a Linux-based cluster of training servers. These machines feature Intel Xeon Phi processors (KNL) and Intel software development tools. Students can use these training servers to run exercises provided in the course or experiment with their own applications.

Receive Certificate


Attendees of these workshops may receive a certificate of completion. The certificate states the Fundamental level of accomplishment in the Parallel Programming Track.

Attending at least 6 out of 10 live broadcast sessions is required to receive the certificate. Certificates are delivered via email within 7 days after the last webinar.

What HOW Series Graduates are Saying:

"An un-parallel resource for learning trade skills of using Intel Xeon Phi platform for Parallel Computing. What makes this program all the more engaging is its focus on the 'doing' aspect as there is no substitute to 'learning by doing'."

Gaurav Verma
HOW attendee

"I am a physicist working with Monte Carlo models of particle transport on matter. My main working language is Fortran and I was able to follow the course without problems. The course gave me all the tools needed to start working with the Intel Xeon Phi MICs."

Edgardo Doerner
HOW attendee

"This is a very well thought out course that gradually introduces the important concepts. The presenter and his team are experts with a lot of experience as evident from the lectures. The live demos for optimizations are very educative and well executed."

Vikram K. Narayana
HOW attendee

Course Roadmap

Instructor Bio

Andrey Vladimirov Andrey Vladimirov, Ph. D., is Head of HPC Research at Colfax International. His primary interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, A. Vladimirov was involved in computational astrophysics research at Stanford University, North Carolina State University, and the Ioffe Institute (Russia), where he studied cosmic rays, collisionless plasmas and the interstellar medium using computer simulations. He is the lead author of the book "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", a regular contributor to the online resource Colfax Research, author of invited papers in "High Performance Parallelism Pearls" and Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, and an author or co-author of over 10 peer-reviewed publications in the fields of theoretical astrophysics and scientific computing.


remote-access We assume that you know the fundamentals of programming in C or C++ in Linux. If you are not familiar with Linux, read a basic tutorial such as this one. We assume that you know how to use a text-based terminal to manipulate files, edit code and compile applications.

Remote Access for Hands-On Exercises

All registrants will receive remote access to a cluster of training servers. The compute nodes in the cluster are based on Intel Xeon Phi x200 family processors (formerly KNL), and additional nodes with Intel Xeon processors and Intel Xeon Phi coprocessors (KNC) are available. Intel software development tools are provided in the cluster under the Evaluation license.

Remote access is be granted a few days before the start of the series. It is active for the entire duration the training, including nights and weekends. However, technical support is available only during normal business hours Pacific time.

You do not have to use remote access to the Colfax Cluster to participate in the hands-on part. During live webinars, the instructor will demonstrate the discussed programming exercises. You can either follow along or use your own computing system (Intel compilers required).

Slides, Code and Video:


code-download Practical exercises in this training were originally published as supplementary code for Programming and Optimization for Intel Xeon Phi Coprocessors. Since then, we have re-released them under the MIT license and posted on GitHub with the latest updates.

You can download the labs as a ZIP archive or clone from GitHub:

git clone


Session 01
Intel Architecture and Modern Code

We introduce Intel Xeon and Intel Xeon Phi processors and discuss their features and purpose. We also begin an introduction to portable, future-proof parallel programming: thread parallelism, vectorization, and optimized memory access pattern. The hands-on part introduces Intel compilers and additional software for efficient solution of computational problems. The hands-on part illustrates the usage of the Colfax Cluster for programming exercises presented in the course.

Slides:  Colfax_HOW_Series_01.pdf (13 MB) — this file is available only to registered users. Register or Log In.

Session 02
Xeon Phi, Coprocessors, Omni-Path

We talk about high-bandwidth memory (MCDRAM) in Intel Xeon Phi processors and demonstrate programming techniques for using it. We also discuss the coprocessor form-factor of the Intel Xeon Phi platform, learning to use the native and the explicit offload programming models. The session introduces the high-bandwidth interconnects based on the Intel Omni-Path Architecture, and discusses its application in heterogeneous programming with offload over fabric.

Slides:  Colfax_HOW_Series_02.pdf (4 MB) — this file is available only to registered users. Register or Log In.

Session 03
Expressing Parallelism with Vectors

This session introduces data parallelism and automatic vectorization. Topics include: the concept of SIMD operations, history and future of vector instructions in Intel Architecture, including AVX-512, using intrinsics to vectorize code, automatic vectorization with Intel compilers for loops, expressions with array notations and SIMD-enabled functions. The hands-on part focuses on using the Intel compiler to perform automatic vectorization, diagnose its success, and making automatic vectorization happen when the compiler does not see opportunities for data parallelism.

Slides:  Colfax_HOW_Series_03.pdf (3 MB) — this file is available only to registered users. Register or Log In.

Session 04
Multi-threading with OpenMP

Crash-course on thread parallelism and the OpenMP framework. We will talk about using threads to utilize multiple processor cores, coordination of thread and data parallelism, using OpenMP to create threads and team them up to process loops and trees of tasks. You will learn to control the number of threads, variable sharing, loop scheduling modes, to use mutexes to protect race conditions, and implementing parallel reduction. The hands-on part demonstrates using OpenMP to parallelize serial computation and shows for-loops, variable sharing, mutexes and parallel reduction on an example application performing numerical integration.

Slides:  Colfax_HOW_Series_04.pdf (3 MB) — this file is available only to registered users. Register or Log In.

Session 05
Distributed Computing, MPI

The last session of the week is an introduction to distributed computing with the Message Passing Interface (MPI) framework. We illustrate the usage of MPI on standard processors as well as Intel Xeon Phi processors and heterogeneous systems with coprocessors. Inter-operation between Intel MPI and OpenMP is illustrated.

Slides:  Colfax_HOW_Series_05.pdf (6 MB) — this file is available only to registered users. Register or Log In.

Session 06
Optimization Overview: N-body

Here we begin the discussion of performance optimization. This episode lays out the optimization roadmap that classifies optimization techniques into five categories. The lecture part of demonstrates the application of some of the techniques from these 5 categories to an example application implementing the direct N-body simulation. The hands-on part of the episode demonstrates the optimization process for the N-body simulation and measures performance gains obtained on an Intel Xeon E5 processor and their scalability to an Intel Xeon Phi coprocessor (KNC) and processor (KNL).

Slides:  Colfax_HOW_Series_06.pdf (3 MB) — this file is available only to registered users. Register or Log In.

Session 07
Scalar Tuning, Vectorization

We start going into the details of performance tuning. The session will cover essential compiler arguments, control of precision and accuracy, unit-stride memory access, data alignment, padding and compiler hints, and the usage of strip-mining for vectorization. The hands-on part will illustrate these techniques in the implementation of LU decomposition of small matrices. We will begin demonstrating strip-mining and AVX-512CD in the computation of binning.

Slides:  Colfax_HOW_Series_07.pdf (2 MB) — this file is available only to registered users. Register or Log In.

Session 08
Common Multi-threading Problems

This session talks about common problems in the optimization of multi-threaded applications. We re-visit OpenMP and the binning example from Session 6 to implement multi-threading in that code. The process takes us to the discussion of race conditions, mutexes, efficient parallel reduction with thread-private variables. We also encounter false sharing and demonstrate how it can be eliminated. The second example discussed in this episode represents stencil operations, and the discussion of this example deals with the problem of insufficient parallelism and demonstrates how to move parallelism from vectors to cores using strip-mining and loop collapse.

Slides:  Colfax_HOW_Series_08.pdf (3 MB) — this file is available only to registered users. Register or Log In.

Session 09
Multi-Threading, Memory Aspect

We continue talking about optimization of multi-threading, this time from the memory access point of view. The material includes principles and methods of affinity control and best practices for NUMA architecture, which includes two-way and four-way Xeon processors and Intel Xeon Phi processors in SNC-2/SNC-4 clustering modes. The hands-on part of the episode demonstrates the tuning of matrix multiplication and array copying on a standard processor as well as an Intel Xeon Phi processor and coprocessor.

Slides:  Colfax_HOW_Series_09.pdf (4 MB) — this file is available only to registered users. Register or Log In.

Session 10
Access to Caches and Memory

Memory traffic optimization: how to optimize the utilization of caches, and how to optimally access the main memory. We discuss the requirement of data access locality in space and time and demonstrate techniques for achieving it: unit-stride access, loop fusion, loop tiling and cache-oblivious recursion. We also talk about optimizing memory bandwidth, focusing on the MCDRAM in Intel Xeon Phi processors. The hands-on part demonstrates the application of the discussed methods to the matrix-vector multiplication code.

Slides:  Colfax_HOW_Series_10.pdf (4 MB) — this file is available only to registered users. Register or Log In.

System Requirements (IMPORTANT!)

Get Your System Ready


To join the webinars:

  • Windows Vista – Windows 10 / Mac OS X 10.6 (Snow Leopard) or later / Google Chrome OS
  • Linux - only with the Google Chrome Browser (Firefox will not work!)
  • Any modern browser for Windows and Mac OS X; Only Google Chrome browser for Linux (Firefox will not work!)
  • 1 Mbps or better Internet connection
  • 2 GB or more RAM
  • Speakers or a headset
  • Recommended: two displays - one for slides, the other for chat and remote access

For remote access:

  • Any recent Linux, Mac OS X, or Windows (additional software required for Windows)
  • 200 Kbps or better Internet connection
  • Allowed outgoing SSH traffic (port 22)

Supplementary Materials


The course is based on Colfax's book "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", second edition. All topics discussed in the training are covered in the book. Practical exercises used in the code come with the paper and with the electronic edition. The book is an optional guide for the HOW series training. Members of Colfax Research can get a discount.

Getting Your Questions Answered

At any time, feel free to post your questions at the Colfax Forums. We even have a special forum for HOW series-related discussion.

During the broadcast training sessions, you can ask questions in the chat window shown below. Instructors will respond either immediately, or during a question/answer session.

Chat Logs

  • September 18, 2017 8:20 am - September 29, 2017 9:51 am close

    AndreyAndrey Welcome to the HOW Series!

    dwilliams.limadwilliams.lima Is there any chance for (in the future) Flat, Cache and Hybrid be configured by software!?

    AndreyAndrey @dwilliams.lima Currently, some platforms (Intel's own server board) support a tool called syscfg. It allows you to change BIOS settings from the OS. After that, reboot is still required.

    AndreyAndrey Without reboot - probably not. From the OS perspective, changing MCDRAM mode looks like attaching or detaching RAM, and I don't think the Linux kernel supports hot-swappable RAM.

    dwilliams.limadwilliams.lima So, it can not work on a per application basis. But it would be a great idea, I think 🙂

    AndreyAndrey In clusters with queue-based job management, you can configure job prologue to reboot nodes into the mode requested by the job. This is transparent to the user, but comes with a delay of a few minutes before the start of a job.

    dwilliams.limadwilliams.lima Thanks Andrey. See you all tomorrow!

    AndreyAndrey Bye everyone!

    patflynnpatflynn When are we going to get info about logging into the cluster?

    nextdesignnextdesign @payflynn You should have received an email a few days ago. It was also covered at the end of last session.

    nextdesignnextdesign @patflynn

    AndreyAndrey @patflynn Just resent your invitation in case you missed it.

    patflynnpatflynn Got it. Thank you.

    dwilliams.limadwilliams.lima There is no need to worry about cross compiling if using numactl!!!?

    AndreyAndrey @dwilliams.lima You need to compile with -xMIC-AVX512 if you are going to use a Knights Landing platform. Whether you use numactl or not does not change the compiler arguments.

    nextdesignnextdesign Hi Andrey. Vector instructions use the same cache as regular data, correct? Say you are trying to calculate the center of a 3d triangle, so three values per vertex. Would it be better to lay out the data as x1 x2 x3 ... xn, y1 y2 y3 ... yn, z1 z2 z3 ... zn, or interleave the data, x1 y1 z1, x2 y2 z2, ..., xn yn zn. I would believe that the interleaving would be more efficient, as it would require less cache-misses, as you wouldn't have to jump

    nextdesignnextdesign around in the structure, from x to y to z. It would be retained in a single cache line.

    nextdesignnextdesign On the other hand, you could also calculate the average of the x components, then the y, then the z, but that would mean more data to keep around in memory.

    AndreyAndrey @nextdesign We will talk about this exact question in Session 6. You can peek into the slides to see this discussion. Usually it is better for performance to have each quantity in which you have data parallelism to be laid out contiguously. I.e., x1 x2 x3 ... , y1 y2 y3, ... is better than x1 y1 z1, x2 y2 z2 ...

    AndreyAndrey This allows you to quickly load data from memory and to completely fill one of your vectors with x1 x2 x3... x16, another with y1 y2... y16, etc.

    goyalgoyal If I use mkl library in my code then we don't need to use explicit -xCORE-AVX2/-xCORE-AVX512 flag in my application code while compiling time. It automatically include these flags according underlying architecture Is it correct? or should I put explicit these flags for using vectorization?

    yshorakayshoraka Is it more efficient to use multiple cores of a xeon-phi processor/co-processor using OpenMP or MPI?
    Would there be any advantages in writing a more complicated code that uses both MPI and OpenMP or just by using MPI, one can achieve the desired performance?

    goyalgoyal Can you please arrange some webinar on mpitune? Since in large code it is very difficult to where should I do optimization? Mpitune gives optimize tune settings for optimization..

    yshorakayshoraka What would happen if you chose 4 nodes in PBS file but asked for 8 nodes in mpirun?
    Then how will the work be distributed inside the nodes?

    yshorakayshoraka What I meant is that if you want to divide available cores inside each node between two MPI jobs, how would you define it?

    AndreyAndrey @goyal If you use MKL in your code, then MKL functions will automatically detect the runtime architecture. However, for your own loops and functions you need to use the corresponding -x flag.

    AndreyAndrey @yshoraka Yes, it is generally best to use OpenMP for single-node calculations and to combine OpenMP with MPI for distributed calculations. Pure MPI has two drawbacks compared to the hybrid approach: greater memory footprint (you have a copy of the code and a copy of the read-only data for each MPI process) and greater communication overhead (some communication patterns scale as O(n^a) with the number of processes n).

    AndreyAndrey @goyal mpitune allows you to find the optimum values of parameters that control the algorithms of collective communication routines in MPI. Thanks for the suggestion of a webinar, I will add it to the discussions.

    AndreyAndrey @yshoraka If you want to have more than one MPI process per compute node, it is usually best to modify the machine file produced by PBS and add the number of nodes after a colon. See Section 7.5 in the access portal:

    AndreyAndrey @goyal When you asked about mpitune, I think you may have been thinking of a tool that would provide optimization pointers. The MPI performance snapshot might be what you need. It gives you advice on what you need to focus on first: communication, OpenMP-parallelized code, or the remaining serial calculations.

    yshorakayshoraka What happened if you double precision instead of float in this code?

    AndreyAndrey You would need to make the corresponding replacement of sqrtf with sqrt, 1.0f with 1.0, and you may need a different value of the tiling constant (8 instead of 16). In the end, I got around 1.1 TFLOP/s in double precision on KNL.

    goyalgoyal If I use -xCORE-AVX512 and -fp model precise(which disable vectorization) both in my code. My question is it makes sense to use both flags together. One is for vectorization and other for disable vectorization or should I use one flag at a time?

    goyalgoyal Like fortran compiler (-align array64byte) is there any flags in C and C++ compiler for alignment

    goyalgoyal I am talking for skylake xeone processor

    AndreyAndrey @goyal Just checked: -fp-model precise does not disable vectorization, and the combination of -xCORE-AVX512 and -fp-model precise produces AVX-512 vectorized code.

    nextdesignnextdesign @Andrey There is a great series of tutorials on OpenMP by Tim Mattson here: It might be a useful link to some.

    AndreyAndrey @nextdesign Thanks, we have a link to this video course on the Intel site. It is in Session 4 on slide 40 ("Intel's OpenMP Video Course"). However, on last check, it turned out that the link is no longer valid. I will replace it in the next revision of the slides.

    nextdesignnextdesign @Andrey Yes, I noticed the Intel site was down as well. I had a question regarding slide 23 in section 7. Could you not extract the redundant part of the loop into a function, and then force inlining on it, therefore adhering to the DRY principle? I know some compilers try their best to ignore inlining directives however.

    nextdesignnextdesign You could also go as far as writing the code in the preprocessor, and then having it replace it at compile time, but that's pretty much universally frowned upon.

    nextdesignnextdesign Sorry for the number of questions, but in exercise 4.02, when I #pragma vector aligned the main loop in CalculateElectricPotential, after replacing malloc with _mm_malloc, I saw a drop in performance of around 200 GFLOPS on KNL and an increase of 200 GFLOPS on KNC. Do you have any idea why this might be the case?

    nextdesignnextdesign I am aligning to 64-byte boundaries.

    AndreyAndrey @nextdesign If you can inline, you should. However, there are cases when you cannot. For example, math libraries, particularly if you only distribute binaries (e.g., transcendental functions from the Intel Math Library are SIMD-enabled).

    AndreyAndrey @nextdesign I tried 4.02 with #pragma vector aligned + _mm_malloc, but I did not see a performance drop on KNL.

    nextdesignnextdesign @Andrey This is the code that I'm using. I just tested it again on the cluster, and I'm able to reproduce it.

    goyalgoyal Like fortran compiler (-align array64byte) is there any flags in C and C++ compiler for alignment for Xeone Skylake Processor

    goyalgoyal may should I use for fortran compiler -align array32bytes?

    AndreyAndrey @nextdesign Thanks, with your code, I did reproduce the behavior that you see, and I will respond later.

    AndreyAndrey @goyal On last check, I have not found any flags for Intel C/C++ compilers similar to -align array[n]byte for the Intel Fortran compiler. For Skylake, use -align array64bytes because AVX-512 prefers 64-byte alignment.

    john1103john1103 Hi Andrey. Spent the last few days at IXPUG.

    john1103john1103 Knight's Mill adds deep learning instructions. It is not an updated/replacement for KNL. The supercomputer guys really like KNL. gcc generated better code than icpc for one presenter - that got Intel's attention.

    AndreyAndrey @john1103 Thanks, John!

    nextdesignnextdesign @Andrey Did you ever manage to determine what the issue was with the alignment slowdowns?

  • August 14, 2017 8:10 am - August 24, 2017 12:04 am
  • July 21, 2017 8:00 am - July 25, 2017 9:05 am