HOW Series “Deep Dive”: Webinars on Performance Optimization, March 2017


In a Nutshell

HOW Series “Deep Dive” is a free 20-hour hands-on in-depth training on parallel programming and performance optimization in computational applications on Intel architecture. The 3rd run in 2017 begins March 13, 2017. Broadcasts start at 16:00 UTC (9:00 am in San Francisco, 12:00 noon in New York, 4:00 pm in London, 7:00 pm in Moscow, 9:30 pm in New Delhi, 1:00 am in Tokyo).

March 2017
    — Webinar+remote access
UTC 16:00
San Francisco 9:00 am
New York 12:00 noon
London 4:00 pm
Moscow 7:00 pm
New Delhi 9:30 pm
Tokyo 1:00 am

Live status as of 41 minutes ago: 19 registrants.


Learn More


Why Attend the HOW Series

Colfax offers free hands-on workshop “Parallel Programming and Optimization for Intel® Architecture”, also known as HOW Series “Deep Dive”. Workshops include 20 hours of Web-based instruction and up to 2 weeks of remote access to dedicated training servers for hands-on exercises. These trainings are free to everyone thanks to Intel’s sponsorship. Here is what the HOW Series training will deliver for you:

Learn Modern Code

Are you realizing the payoff of parallel processing? Are you aware that without code optimization, computational applications may perform orders of magnitude worse than they are supposed to?

The Web-based HOW Series training provides extensive knowledge needed to extract more of the parallel compute performance potential found in both Intel® Xeon® and Intel® Xeon Phi™ processors and coprocessors. Course materials and practical exercises are appropriate for developers beginning their journey to parallel programming, with enough detail to also cater to high-performance computing experts.

Jump to detailed descriptions of sessions to see what you will learn.

Practice New Skills

The HOW series is an experiential learning program because you get to see code optimization performed live and also get to practice it with your own hands. The workshop consists of instructional and hands-on self-study components:

The instructional part of each workshop consists of 10 lecture sessions. Each session presents 1 hour of theory and 1 hour of practical demonstrations. Lectures can be viewed live by joining live broadcast, or offline as streaming video.

In the self-study part, for the duration of the workshop attendees are provided with remote access over SSH to a Linux-based cluster of training servers. These machines feature Intel Xeon Phi processors (KNL) and Intel software development tools. Students can use these training servers to run exercises provided in the course or experiment with their own applications.

Receive Certificate

Attendees of these workshops may receive a certificate of completion. The certificate states the Fundamental level of accomplishment in the Parallel Programming Track.


Attending at least 6 out of 10 live broadcast sessions is required to receive the certificate. Certificates are delivered via email within 7 days after the last webinar.

Course Roadmap

Instructor Bio

andrey-sq Andrey Vladimirov, Ph. D., is Head of HPC Research at Colfax International. His primary interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, A. Vladimirov was involved in computational astrophysics research at Stanford University, North Carolina State University, and the Ioffe Institute (Russia), where he studied cosmic rays, collisionless plasmas and the interstellar medium using computer simulations. He is the lead author of the book “Parallel Programming and Optimization with Intel Xeon Phi Coprocessors“, a regular contributor to the online resource Colfax Research, author of invited papers in “High Performance Parallelism Pearls” and Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, and an author or co-author of over 10 peer-reviewed publications in the fields of theoretical astrophysics and scientific computing.


remote-access We assume that you know the fundamentals of programming in C or C++ in Linux. If you are not familiar with Linux, read a basic tutorial such as this one. We assume that you know how to use a text-based terminal to manipulate files, edit code and compile applications.

Remote Access for Hands-On Exercises

All registrants will receive remote access to a cluster of training servers. The compute nodes in the cluster are based on Intel Xeon Phi x200 family processors (formerly KNL), and additional nodes with Intel Xeon processors and Intel Xeon Phi coprocessors (KNC) are available. Intel software development tools are provided on the cluster under the Evaluation license.

Remote access is be granted a few days before the start of the series. It is active for the entire duration the training, including nights and weekends. However, technical support is available only during normal business hours Pacific time.

You do not have to use remote access to the Colfax Cluster to participate in the hands-on part. During live webinars, the instructor will demonstrate the discussed programming exercises. You can either follow along, or use your own computing system (Intel compilers required).

Slides, Code and Video:


code-download Practical exercises in this training were originally published as supplementary code for Programming and Optimization for Intel Xeon Phi Coprocessors. Since then, we have re-released them under the MIT license, and posted on GitHub with the latest updates.

You can download the labs as a ZIP archive or clone from GitHub:

git clone


Session 01 (Mar 13, 2017)
Intel Architecture and Modern Code

Recording will be posted here after the webinar

We introduce Intel Xeon and Intel Xeon Phi processors and discuss their features and purpose. We also begin introduction to portable, future-proof parallel programming: thread parallelism, vectorization, and optimized memory access pattern. The hands-on part introduces Intel compilers and additional software for efficient solution of computational problems. The hands-on part illustrates the usage of the Colfax Cluster for programming exercises presented in the course.

Slides: coming soon…

Session 02 (Mar 14, 2017)
Xeon Phi, Coprocessors, Omni-Path

Recording will be posted here after the webinar

We talk about high-bandwidth memory (MCDRAM) in Intel Xeon Phi processors and demonstrate programming techniques for using it. We also discuss the coprocessor form-factor of the Intel Xeon Phi platform, demonstrating native programming and the explicit offload programming models. The session introduces the high-bandwidth interconnects based on the Intel Omni-Path Architecture, and discusses its application in heterogeneous programming with offload over fabric.

Slides: coming soon…

Session 03 (Mar 15, 2017)
Expressing Parallelism with Vectors

Recording will be posted here after the webinar

This session introduces data parallelism and automatic vectorization. Topics include: the concept of SIMD operations, history and future of vector instructions in Intel Architecture, including AVX-512, using intrinsics to vectorize code, automatic vectorization with Intel compilers for loops, expressions with array notations and SIMD-enabled functions. The hands-on part focuses on using the Intel compiler to perform automatic vectorization, diagnose its success, and making automatic vectorization happen when the compiler does not see opportunities for data parallelism.

Slides: coming soon…

Session 04 (Mar 16, 2017)
Multi-threading with OpenMP

Recording will be posted here after the webinar

Crash-course on thread parallelism and the OpenMP framework. We will talk about using threads to utilize multiple processor cores, coordination of thread and data parallelism, using OpenMP to create threads and team them up to process loops and trees of tasks. You will learn to control the number of threads, variable sharing, loop scheduling modes, to use mutexes to protect race conditions, and implementing parallel reduction. The hands-on part demonstrates using OpenMP to parallelize serial computation and demonstrates for-loops, variable sharing, mutexes and parallel reduction on an example application performing numerical integration.

Slides: coming soon…

Session 05 (Mar 17, 2017)
Distributed Computing, MPI

Recording will be posted here after the webinar
The last session of the week is an introduction into distributed computing with the Message Passing Interface (MPI) framework. We illustrate the usage of MPI on standard processors as well as Intel Xeon Phi processors and heterogeneous systems with coprocessors. Inter-operation between Intel MPI and OpenMP is illustrated.

Slides: coming soon…

Session 06 (Mar 20, 2017)
Optimization Overview: N-body

Recording will be posted here after the webinar

Here we begin the discussion of performance optimization. This episode lays out the optimization roadmap that classifies optimization techniques into five categories. The lecture part of demonstrates the application of some of the techniques from these 5 categories to an example application implementing the direct N-body simulation. The hands-on part of the episode demonstrates the optimization process for the N-body simulation and measures performance gains obtained on an Intel Xeon E5 processor and their scalability to an Intel Xeon Phi coprocessor (KNC) and processor (KNL).

Slides: coming soon…

Session 07 (Mar 21, 2017)
Scalar Tuning, Vectorization

Recording will be posted here after the webinar

We start going into the details of performance tuning. The session will cover essential compiler arguments, control of precision and accuracy, unit-stride memory access, data alignment, padding and compiler hints, and the usage of strip-mining for vectorization. The hands-on part will illustrate these techniques in the implementation of LU decomposition of small matrices. We will begin demonstrating strip-mining and AVX-512CD in the computation of binning.

Slides: coming soon…

Session 08 (Mar 22, 2017)
Common Multi-threading Problems

Recording will be posted here after the webinar

This session talks about common problems in the optimization of multi-threaded applications. We re-visit OpenMP and the binning example from Session 6 to implement multi-threading in that code. The process takes us to the discussion of race conditions, mutexes, efficient parallel reduction with thread-private variables. We also encounter false sharing and demonstrate how it can be eliminated. The second example discussed in this episode represents stencil operations, and the discussion of this example deals with the problem of insufficient parallelism and demonstrates how to move parallelism from vectors to cores using strip-mining and loop collapse.

Slides: coming soon…

Session 09 (Mar 23, 2017)
Multi-Threading, Memory Aspect

Recording will be posted here after the webinar

We continue talking about optimization of multi-threading, this time from the memory access point of view. The material includes principles and methods of affinity control and best practices for NUMA architecture, which includes two-way and four-way Xeon processors and Intel Xeon Phi processors in SNC-2/SNC-4 clustering modes. The hands-on part of the episode demonstrates the tuning of matrix multiplication and array copying on a standard processor as well as an Intel Xeon Phi processor and coprocessor.

Slides: coming soon…

Session 10 (Mar 24, 2017)
Access to Caches and Memory

Recording will be posted here after the webinar

Memory traffic optimization: how to optimize the utilization of caches, and how to optimally access the main memory. We discuss the requirement of data access locality in space and time and demonstrate techniques for achieving it: unit-stride access, loop fusion, loop tiling and cache-oblivious recursion. We also talk about optimizing memory bandwidth, focusing on the MCDRAM in Intel Xeon Phi processors. The hands-on part demonstrates the application of the discussed methods to the matrix-vector multiplication code.

Slides: coming soon…

System Requirements (IMPORTANT!)

Get Your System Ready


To join the webinars:

  • Windows Vista – Windows 10, Mac OS X 10.6 (Snow Leopard) – Mac OS X 10.11 (El Capitan), Google Chrome OS or Linux, any distribution that supports the Google Chrome Browser (Firefox will not work!)
  • Any modern browser for Windows and Mac OS X; Only Google Chrome browser for Linux (Firefox will not work!)
  • 1 Mbps or better Internet connection
  • 2 GB or more RAM
  • Speakers or a headset
  • Recommended: two displays – one for slides, the other for chat and remote access

For remote access:

  • Any recent Linux, Mac OS X, or Windows (additional software required for Windows)
  • 200 Kbps or better Internet connection
  • Allowed outgoing SSH traffic (port 22)

Supplementary Materials


The course is based on Colfax’s book “Parallel Programming and Optimization with Intel Xeon Phi Coprocessors“, second edition. All topics discussed in the training are covered in the book. Practical exercises used in the code come with the paper and with the electronic edition. The book is an optional guide for the HOW series training. Members of Colfax Research can get a discount.

Getting Your Questions Answered

At any time, feel free to post your questions at the Colfax Forums. We even have a special forum for HOW series-related discussion.

During the broadcast training sessions, you can ask questions in the chat window shown below. Instructors will respond either immediately, or during a question/answer session.

Chat Logs

No Chat logs found for this session