

## PROGRAMMING AND OPTIMIZATION FOR INTEL® ARCHITECTURE

Hands-On Workshop (HOW) Series "Deep Dive" Session 1

Colfax International — colfaxresearch.com



### DISCLAIMER

While best efforts have been used in preparing this training, Colfax International makes no representations or warranties of any kind and assumes no liabilities of any kind with respect to the accuracy or completeness of the contents and specifically disclaims any implied warranties of merchantability or fitness of use for a particular purpose. The publisher shall not be held liable or responsible to any person or entity with respect to any loss or incidental or consequential damages caused, or alleged to have been caused, directly or indirectly, by the information or programs contained herein. No warranty may be created or extended by sales representatives or written sales materials.



### **COURSE ROADMAP**

- ▷ Module I. Programming Models
  - 01. Intel Architecture and Modern Code
  - 02. Xeon Phi, Coprocessors, Omni-Path
- ▷ Module II. Expressing Parallelism
  - 03. Automatic vectorization
  - 04. Multi-threading with OpenMP
  - 05. Distributed Computing, MPI
- ▷ Module III. Performance Optimization
  - 06. Optimization Overview: N-body
  - 07. Scalar tuning, Vectorization
  - 08. Common Multi-threading Problems
  - 09. Multi-threading, Memory Aspect
  - 10. Access to Caches and Memory

### **HOW SERIES ONLINE**

### Course page: colfaxresearch.com/how-series

- ▶ Slides
- ⊳ Code
- ▶ Video
- ▶ Chat

### More workshops: colfaxresearch.com/training





### **GET YOUR QUESTIONS ANSWERED: CHAT**



### colfaxresearch.com/how-series



### **GET YOUR QUESTIONS ANSWERED: FORUMS**

 READ
 WATCH
 LEARN
 FORUMS
 CONNECT
 JOIN

 Forum

 Colfax Cluster

 Discussion of Colfax Cluster usage policies, troubleshooting.

 Developer Training, HOW Series

 Questions about any of the Colfax trainings? Usage of training servers, experience with specific exercises, inquiries on what's inside, suggestions for

future trainings - post them here.

#### **Performance Optimization and Parallelism**

Discuss with Colfax Research and colleagues any topics related to computational science, parallel programming, performance optimization and code modernization.

### colfaxresearch.com/forum

#### RESOURCES

- All registrants receive an invitation from cluster@colfaxresearch.com
- Queue-based access to Intel Xeon E5, Intel Xeon Phi (KNC and KNL)
- Can access the cluster the entire 2 weeks of the workshop





### **COLFAX RESEARCH**



#### https://colfaxresearch.com/



# **§2. MODERN CHALLENGES**

### **AREAS OF APPLICATION**

### **COMPUTING DOMAINS**



### **NEED FOR SPEED**

### **PERFORMANCE OPTIMIZATION OUTCOMES**



# **§3. MODERN ARCHITECTURE**

### **HOW PROCESSORS GET FASTER**

### **40 YEARS OF MICROPROCESSOR DATA**



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

Source: karlrupp.net

#### HOW PROCESSORS GET FASTER

© Colfax International, 2013–2017

### **POWER WALL**

power







Liquid nitrogen for CPU Mic overclocking imm

Microsoft's project Natick: immersed datacenter

youtu.be/WZr0W\_g0dqk

news.microsoft.com/natick

Cooling solutions for high clock speeds are not practical or expensive

### INSTRUCTION-LEVEL PARALLELISM (ILP) WALL: PIPELINING

Pipelining – replication of hardware to run different stages of different instruction streams at the same time

| Pipeline Stage |   | FETCH | DECODE | EXECUTE | MEMORY  | WRITE   |         |         |
|----------------|---|-------|--------|---------|---------|---------|---------|---------|
|                |   |       | FETCH  | DECODE  | EXECUTE | MEMORY  | WRITE   |         |
|                |   |       |        | FETCH   | DECODE  | EXECUTE | MEMORY  | WRITE   |
|                |   |       |        |         | FETCH   | DECODE  | EXECUTE | MEMORY  |
|                |   |       |        |         |         | FETCH   | DECODE  | EXECUTE |
| Ρ              |   |       |        |         |         |         | FETCH   | DECODE  |
| ,              | 1 |       |        |         |         |         |         |         |

Clock Cycles

Only so many pipeline stages, possible conflicts

HOW PROCESSORS GET FASTER

### INSTRUCTION-LEVEL PARALLELISM (ILP) WALL: SUPERSCALAR EXECUTION

Superscalar Execution – hardware checks for independence of operations, pipelines multiple instructions in a cycle

| Pipeline Stage | FETCH | DECODE | EXECUTE | MEMORY  | WRITE   |         |       |
|----------------|-------|--------|---------|---------|---------|---------|-------|
|                | FETCH | DECODE | EXECUTE | MEMORY  | WRITE   |         |       |
|                |       | FETCH  | DECODE  | EXECUTE | MEMORY  | WRITE   |       |
|                |       | FETCH  | DECODE  | EXECUTE | MEMORY  | WRITE   |       |
|                |       |        | FETCH   | DECODE  | EXECUTE | EXECUTE | WRITE |
| -              |       |        | FETCH   | DECODE  | EXECUTE | EXECUTE | WRITE |
|                |       |        |         |         |         |         |       |

Clock Cycles

Automatic search for independent instructions requires extra resources

HOW PROCESSORS GET FASTER

# Out-of-order Execution – hardware re-orders instructions in a stream to minimize latencies

| In-Order    | FETCH | DECODE | EXECUTE | MEMORY  | MEMORY  | MEMORY | WRITE   |
|-------------|-------|--------|---------|---------|---------|--------|---------|
|             |       | FETCH  | DECODE  | STALL   | STALL   | STALL  | EXECUTE |
|             |       |        | FETCH   | STALL   | STALL   | STALL  | DECODE  |
|             |       |        |         |         |         |        |         |
| Drer        | FETCH | DECODE | EXECUTE | MEMORY  | MEMORY  | MEMORY | WRITE   |
| Jut-of-Orer |       | FETCH  | DECODE  | EXECUTE | WRITE   |        |         |
| Out-        |       |        | FETCH   | DECODE  | EXECUTE | WRITE  |         |

CPU performance grows faster than RAM bandwidth, OOE can't fill gap

### PARALLELISM

CORES – multiple instructions on multiple data elements (MIMD) VECTORS – single instruction on multiple data elements (SIMD)



### Unbounded growth opportunity, but not automatic

colfaxresearch.com/how-series

#### HOW PROCESSORS GET FASTER

- ▷ Clock speed has hit the power wall
- Automatic parallelism has hit the ILP wall
- Out-of-order execution cannot overcome the memory wall

### The Show Must Go On

### Hardware keeps evolving through parallelism. Software must catch up!

### PARALLEL PROGRAMMING LAYERS



#### HOW PROCESSORS GET FASTER

### **INTEL ARCHITECTURE**

### **INTEL COMPUTING PLATFORMS**



#### **Computing Accelerators**

Intel® VCA (x86) Intel® Nervana<sup>™</sup> Platform Intel® DLIA<sup>™</sup> (FPGAs)



#### **Network Interconnects**

Intel® Omni-Path<sup>™</sup> Architecture



### INTEL XEON PROCESSORS

- ▶ 1-, 2-, 4-way
- General-purpose
- ▶ Highly parallel (44 cores\*)
- Resource-rich
- Forgiving performance
- ▷ Theor. ~ 1.0 TFLOP/s in DP\*
- ▷ Meas. ~ 154 GB/s bandwidth\*

\* 2-way Intel Xeon processor, Broadwell architecture (2016), top-of-the-line (e.g., E5-2699 V4)



### INTEL XEON PHI PROCESSORS (1ST GEN)

- PCIe add-in card
- Specialized for computing
- ▶ Highly-parallel (61 cores\*)
- Balanced for compute
- Less forgiving
- ▷ Theor. ~ 1.2 TFLOP/s in DP\*
- ▷ Meas. ~ 176 GB/s bandwidth\*

\* Intel Xeon Phi coprocessor, Knighs Corner architecture (2012), top-of-the-line (e.g., 7120P)





#### INTEL ARCHITECTURE

© Colfax International, 2013–2017

### INTEL XEON PHI PROCESSORS (2ND GEN)

- Bootable or PCIe add-in card
- Specialized for computing
- ▶ Highly-parallel (72 cores\*)
- Balanced for compute
- Less forgiving than Xeon
- ▶ Theor. ~ 3.0 TFLOP/s in DP\*
- ▷ Meas. ~ 490 GB/s bandwidth\*

\* Intel Xeon Phi processor, Knighs Landing architecture (2016), top-of-the-line (e.g., 7290P)





### FORM-FACTORS AND MEMORY ORGANIZATION

### INTEL XEON CPU: MEMORY ORGANIZATION

- ▶ Hierarchical cache structure
- ▷ Two-way processors have NUMA architecture





#### FORM-FACTORS AND MEMORY ORGANIZATION

colfaxresearch.com/how-series

### KNC MEMORY ORGANIZATION

- Direct access to  $\leq$  16 GiB of cached GDDR5 memory on board  $\triangleright$
- No access to system DDR4, connected to host via PCIe  $\triangleright$



#### FORM-FACTORS AND MEMORY ORGANIZATION

intel

### **KNL MEMORY ORGANIZATION (BOOTABLE)**

- ▷ On-package high-bandwidth memory (HBM) MCDRAM
- Optimized for arithmetic performance and bandwidth (not latency)





# **§4. MODERN CODE**

### ONE CODE FOR ALL PLATFORMS



### **OPTIMIZATION AND FUTURE-PROOFING**

### **COMPUTING IN SCIENCE AND ENGINEERING**



#### **OPTIMIZATION AND FUTURE-PROOFING**

### **OPTIMIZATION AREAS**



**OPTIMIZATION AND FUTURE-PROOFING** 

## **ASTROPHYSICAL CODE HEATCODE: AN OFFLOAD STORY**



#### https://colfaxresearch.com/heatcode

## **ASTROPHYSICAL CODE HEATCODE: AN OFFLOAD STORY**



#### https://colfaxresearch.com/heatcode

## **COMPUTATIONAL FLUID DYNAMICS: LEGACY CODE**



#### https://colfaxresearch.com/shallow-water

## ASIAN OPTION PRICING: HETEROGENEOUS CLUSTERING



https://colfaxresearch.com/heterogeneous

MOTIVATING EXAMPLES

© Colfax International, 2013–2017

## MACHINE LEARNING: OPTIMIZED MIDDLEWARE

#### INTEL® XEON PHI™ PROCESSORS — MACHINE LEARNING



......

NEURALTALK2 — OPEN SOURCE IMAGE TAGGING CODE (KARPATHY & FEI-FEI, STANFORD)



https://colfaxresearch.com/isc16-neuraltalk

MOTIVATING EXAMPLES

© Colfax International, 2013–2017

### WHAT YOU ARE GOING TO LEARN

## **HETEROGENEOUS AND NUMA ARCHITECTURES**



Session 2: handling memory organization in Intel Xeon Phi processors

Vectors – form of SIMD architecture (Single Instruction Multiple Data).

 Scalar Instructions
 Vector Instructions

 4 + 1 = 5 4 + 1 = 5 

 0 + 3 = 3 0 + 3 = 3 

 -2 + 8 = 6 -2 + 8 = 6 

 9 + -7 = 2 9 -7 + 2 

#### Session 3: automatic vectorization with Intel compilers

#### Cores implement MIMD (Multiple Instruction Multiple Data) arch





#### Session 4: multi-threading with OpenMP

colfaxresearch.com/how-series

WHAT YOU ARE GOING TO LEARN

© Colfax International, 2013–2017

#### Clusters form distributed-memory systems with network interconnects





#### **Session 5**: Message Passing Interface (MPI)

## **OPTIMIZATION OVERVIEW**



Session 6: optimization overview, case study

### SCALAR TUNING, OPTIMIZATION OF VECTORIZATION

for (i = 0; i < n; i++) A[i] = ...



#### Session 7: precision control, regularizing vectorization patterns

#### WHAT YOU ARE GOING TO LEARN

colfaxresearch.com/how-series

© Colfax International, 2013–2017

## **COMMON ISSUES IN MULTI-THREADING**



**Session 8**: minimizing synchronization, avoiding false sharing, strip-mining for parallelism

## **MULTI-THREADING, MEMORY ASPECT**



Session 9: thread affinity, NUMA locality, scheduling

### **CACHE AND MEMORY ACCESS**



#### Session 10: loop transformations for locality, bandwidth secrets

# **§5. HANDS-ON DEMONSTRATION**

#### ACCESS THE COLFAX CLUSTER



#### Welcome to Colfax Cluster!



#### HANDS-ON DEMONSTRATION

### JOB MANAGEMENT IN THE CLUSTER



## **DOWNLOAD LABS**

```
[u111@c005 ~]% git clone https://github.com/ColfaxResearch/HOW-Series-Labs.git
Cloning into 'HOW-Series-Labs'...
[u111@c005 ~]% ls HOW-Series-Labs/*/
HOW-Series-Labs/2/:
2.01-native-basic 2.03-offload-basic
                                              2.05-shared-virtual-memory-basic
2.02-native-MPI 2.04-offload-asynchronous
                                              2.06-shared-virtual-memory-complex-
HOW-Series-Labs/3/:
                    3.03-OpenMP-reduction 3.05-Cilk-Plus-basics
3.01-vectorization
                                                                    3.07-Cilk-Plu
                    3.04-OpenMP-tasks
3.02-OpenMP-basics
                                           3.06-Cilk-Plus-reducers
                                                                    3.08-MPT-basi
HOW-Series-Labs/4/:
4.01-overview-nbody
                                               4.07-threading-affinity
4.02-vectorization-data-structures-coulomb
                                               4.08-memory-tiling-matrix_x_vector
4.03-vectorization-tuning-lu-decomposition
                                               4.09-memory-loop-fusion-statistics
4.04-threading-misc-histogram
                                               4.10-offload-double-buffering-dgem
. . .
```

HANDS-ON DEMONSTRATION

## **REVIEW AND WHAT'S NEXT**

- ▷ Computers are getting faster through parallelism and specialization
- ▷ Intel Xeon E5 product family general-purpose parallel processors
- Intel Xeon Phi product family specialized parallel processors
- ▷ Coprocessor either offload device or an additional compute node

Next session: details of Intel Xeon Phi processor and coprocessor programming.