

# Hotspot-Guided Optimization with Intel VTune Amplifier XE

The Hands-On Workshop (HOW) Series "Tools"

Andrey Vladimirov, PhD — Colfax International colfaxresearch.com

### Disclaimer

While best efforts have been used in preparing this training, Colfax International makes no representations or warranties of any kind and assumes no liabilities of any kind with respect to the accuracy or completeness of the contents and specifically disclaims any implied warranties of merchantability or fitness of use for a particular purpose. The publisher shall not be held liable or responsible to any person or entity with respect to any loss or incidental or consequential damages caused, or alleged to have been caused, directly or indirectly, by the information or programs contained herein. No warranty may be created or extended by sales representatives or written sales materials.

## About the Series

Hands-On Workshop (HOW "Tools" Series): webinars on efficient programming for the Intel architecture with the help of dedicated software development tools



#### colfaxresearch.com/how-tools-16-06

#### Learn More



\*10x 2-hour sessions | 24-hour 2-weeks remote access to a system | Filling up fast, register now!

#### Interested? Sign-up at:

colfaxresearch.com/how-series

Welcome

## Get Your Questions Answered

# **Chat** (for this course): colfaxresearch.com/how-tools-16-06

|                                        | -6 |
|----------------------------------------|----|
| eofernandesmo Hello from Recife Brazil |    |
| geesansi Hi, Napies, Zaiy              |    |
| Info2harish Harish From INDIA          |    |
| hpcfan Helio, from Texas.              |    |
| Reference radekg1000 Hi, Poznan/Poland |    |
| zanton hello, Tokyo, JP                | ŀ  |

# Forums (general): colfaxresearch.com/discussion

| COLFAX RESEARCH                                                                                                                   | Log In/Out or Register    |
|-----------------------------------------------------------------------------------------------------------------------------------|---------------------------|
| / READ WATCH LEARN DISCUSS CONNECT JOIN                                                                                           |                           |
| computiliva<br>solutions                                                                                                          |                           |
| Join the Conversa PARALLEL                                                                                                        |                           |
| Welcome to Colfax Research forums, a engage with HPC experts, software architects, developers, or                                 | omputational researchers, |
| scientists, students and more-so you can acquire new knowledge, share ideas, and build new relationships.                         |                           |
| Tan our experts and your pages to bein meet the challenne of ontimizing applications on modern bardware. This is the place to bro | rese or post questions    |

Tap our experts and your peers to help meet the challenge of optimizing applications on modern hardware. This is the place to browse or post question (and get answers) related to computational science, parallel programming and code modernization on Intel® Architecture.

Welcome aboard. Post questions today!

## **§2.** Intel Architecture

## **Computing Platforms**



## Intel Xeon CPU: Purpose and Specifications

General-purpose platform for demanding computing applications.

- Up to ~ 1 TFLOP/s in DP
- Up to ~ 2 TFLOP/s in SP
- Up to 768 GiB DDR4 RAM
- ~ 126 GB/s bandwidth
- Hardware-rich: forgiving of sub-optimal code





## Intel Xeon Phi Processors (1st Gen)

Specialized platform for demanding computing applications.

- PCIe end-point device
- ~ 1.2 TFLOP/s in DP
- ~ 2.4 TFLOP/s in SP
- Up to 16 GiB GDDR5 RAM
- ~ 176 GB/s bandwidth
- Heterogeneous clustering
- Runs special Linux distribution

## Intel Xeon Phi Processors (2nd Gen)

Specialized platform for demanding computing applications.

- Socket version or coprocessor
- 3+ TFLOP/s in DP
- 6+ TFLOP/s in SP
- Up to 16 GiB MCDRAM
- $\geq$  400 GB/s MCDRAM bandwidth
- Up to 384 GiB DDR4 RAM
- $\geq$  90 GB/s DDR4 bandwidth
- Supports common OS
- Register for webinar





## "Standard Candle" Testbench





vs.

One Intel Xeon Phi 7120P coprocessor (2012) TDP: 300 W, RCP: \$4129 Two Intel Xeon E5-2697 v3 CPUs (2014) TDP: 290 W, RCP: \$5404

#### See also "Intel Xeon Product Family: Performance Brief"

# **§3.** Computation and Optimization

## Computing in Science and Engineering



## **Optimization Methodology**

### **Optimization Areas**

- Scalar optimization
- Vectorization
- Multi-threading
- Memory access
- Communication

Details in HOW Series + our book.

### How to Navigate Them

 "Hotspot-guided optimization": iteratively use VTune, scalability tests + optimization report

## Hotspot-Guided Optimization

## Hotspot-Guided Optimization



## Using Intel VTune Amplifier XE

- Compile code with -g -02 or -g -03
- Set environment variables or use a wrapper script
- Tweak code input for a short representative run

- Advanced hotspots good starting point
- Focus on optimization report for detected hotspots

## **Clues in Optimization Report**

- Compile code with -qopt-report
- Solution For more verbosity, use up to -qopt-report=5

Things to look for:

- Failed vectorization
- Peeling in short vector loops
- Strided load/store on CPU = gather/scatter on MIC
- Type conversions
- Multiversioning in vectorized loops
- Suggestions for loop permutation

## VTune to detect most important locations.

## **Optimization Report Example**

LOOP BEGIN at TransientHeatingFunctionsXeonPhi.cc(162,4) remark #15389: vectorization support: reference bMatrix has unaligned access remark #15389: vectorization support: reference \_M\_data has unaligned access remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 8 remark #15399: vectorization support: unroll factor set to 2 remark #15309: vectorization support: normalized vectorization overhead 1.118 remark #15417: vectorization support: number of FP up converts: single prec.. remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15450: unmasked unaligned unit stride loads: 2 remark #15475: --- begin vector loop cost summary --remark #15476: scalar loop cost: 10 remark #15477: vector loop cost: 2.120 remark #15478: estimated potential speedup: 3.890 remark #15487: type converts: 1 remark #15488: --- end vector loop cost summary ---LOOP END

# §4. Case Study: HEATCODE

#### Case Study: Building a 3D Model of the Milky Way Galaxy using 2D Sky Surveys



#### Case Study: Building a 3D Model of the Milky Way Galaxy using 2D Sky Surveys

**Goal:** build a 3D model of the Milky Way Galaxy using a large volume of 2D data from sky surveys.

**Method:** Bayesian inference. Simulate the Galaxy, assess the fit to data, refine 3D model parameters, rinse & repeat.

**Challenge:** modeling the process of stochastic heating of cosmic dust by starlight, in each voxel of a 3D grid, is very time consuming. With unoptimized code, **hundreds of CPU-years** for each run.

One of possible realizations of 3D models of the Milky Way Galaxy (cosmic dust luminosity map calculated by the FRaNKIE code)

Sun

40 kiloparsecs

#### Software Stack for Modeling Galactic 3D Structure



#### **MultiNest** Bayesian analysis engine Scales to O(10) nodes

FRaNKIE

radiation transport Monte Carlo Scales to multiple cores in 1 node

#### HEATCODE cosmic dust heating library Multiple Xeon Phi coprocessors in 1 node

colfaxresearch.com/how-tools-16-06

CPU

Xeon

Phi

Case Study: HEATCODE

© Colfax International, 2013-2016

#### **Accelerating Radiation Transport Models** for the Milky Way

**Solution:** use a computing accelerator, optimize existing code.

Calculation of Stochastic Heating and Emissivity of Cosmic Dust Grains with Optimization for the Intel Many-Core Architecture

**Result:** HEATCODE (HEterogeneous Architecture library for sTochastic **COsmic Dust Emissivity**)

(open source, code soon to be published) http Troy A. Porter<sup>1</sup>, Andrey E. Vladimirov<sup>1,2</sup>

Invision Laboratory, Stanford University, 452 Lomita Mall, Stanford, CA 94305-4085, USA

Harxiv.orglabs1311.4627 ption of starlight produces emission spectra from t the dust grains, and spectrum of the heating radiation field. nissions by very small grains. Modeling the absorption of starlight by these however, computationally expensive and a significant bottleneck for self-consistent radiation transport codes treating

of dust by stars. In this paper, we summarize the formalism for computing the stochastic emissivity of cosmic dust,

colfaxresearch.com/how-tools-16-06

#### Case Study: HEATCODE

© Colfax International, 2013–2016

Hundreds of

**CPU-years** 

Hundreds

of CPU-

davs

#### **Stochastic Dust Grain Heating**

- Small grains (≤0.1 µm) are important
- Absorption and re-emission is stochastic (non-thermal)
- Grains undergo "temperature" spikes, characterized by temperature distribution
- Evaluation is computationally expensive



#### **Calculation of Stochastic Dust Emissivity**

- **Input**: incident electromagnetic radiation field
- **Intermediate**: "temperature" distribution of grains of all sizes
- **Output**: spectrum of re-emitted photons
- Method and absorption cross sections: Draine et al. (2001), ApJ, 551, 807



#### **Matrix Formalism for Stochastic Dust Emissivity**

• Stage 1:

Interpolate (in log space) and convolve the incident RF with the photon absorption cross sections

#### • Stage 2:

form and solve a quasi-triangular system of linear algebraic equations for the "temperature" distribution • Stage 3: convolve the "temperature" distribution with the grain size distribution and emissivity function

$$T_{ul} = I(\lambda)\sigma(\lambda)\frac{\lambda^3 \Delta E_{ul}}{hc^2} \quad \text{for } u > l.$$

$$I(\lambda)\sigma(\lambda) \equiv \Omega(\lambda)$$

$$\log\left[\frac{\Omega(\lambda)}{\Omega(\lambda_{j-1})}\right] = \frac{\log(\lambda/\lambda_{j-1})}{\log(\lambda_{j/\lambda_{j-1}})}\log\left[\frac{\Omega(\lambda_j)}{\Omega(\lambda_{j-1})}\right] \quad X_f = \frac{1}{T_{(f-1)f}}\sum_{j=0}^{f-1} B_{fj}X_j$$

$$VF_a(v) = \sigma(v)\sum_{i=0}^M P_i(a)\Lambda(v, E_i)$$

$$\Lambda(v, E_i) = \begin{cases} 0, \text{ if } E_i < hv, \\ \Omega(v, E_i) = \begin{cases} 0, \text{ if } E_i < hv, \\ 2hv^4 & P_i \\ exp(hv/kT_i) - 1 \end{cases}$$

$$VF_a(v) = \sigma(v)\sum_{i=0}^M P_i(a)\Lambda(v, E_i)$$

$$\Lambda(v, E_i) = \begin{cases} 0, \text{ if } E_i < hv, \\ 2hv^4 & P_i \\ exp(hv/kT_i) - 1 \end{cases}$$

$$VF_a(v) = \sigma(v)\sum_{i=0}^M P_i(a)\Lambda(v, E_i)$$

$$\Lambda(v, E_i) = \begin{cases} 0, \text{ if } E_i < hv, \\ 2hv^4 & P_i \\ exp(hv/kT_i) - 1 \end{cases}$$

$$VF_a(v) = \sigma(v)\sum_{i=0}^M P_i(a)\Lambda(v, E_i)$$

$$\Lambda(v, E_i) = \begin{cases} 0, \text{ if } E_i < hv, \\ 0, \text{ if } E_i < hv, \\ 0, \text{ or } E_i <$$

# **§5.** Practical Optimization

## "Standard Candle" Testbench





vs.

One Intel Xeon Phi 7120P coprocessor (2012) TDP: 300 W, RCP: \$4129 Two Intel Xeon E5-2697 v3 CPUs (2014) TDP: 290 W, RCP: \$5404

#### See also "Intel Xeon Product Family: Performance Brief"

## Focust Point 1: Algorithm

- Before: 3 nested loops  $O(N^3)$  complexity
- After: 2 nested loops  $O(N^2)$  complexity
- Used recurrent relation to improve algorithm

Speedup in HEATCODE: 3.7x

## Focus Point 2: Thread Scalability

- Minimize synchronization
- Tune scheduling for load balancing
- Expose all of the parallalelism
- Make threading co-exist with vectorization

Speedup in HEATCODE: 1.2x (total: 4.4x)

#### More details in the HOW series Session 4 and Session 7

## **HEATCODE Scalability Tuning**



- **Precompute + look up** vs **Compute on time** tradeoff
- Rule of thumb: 100 FLOPs = 1 memory access
- Precomputation helped in HEATCODE

Speedup in HEATCODE: 4.3x (total: 19x)

## Fruit 2: Scalar Tuning

- All double precision = OK
- All single precision = GREAT
- Mixed precision = BAD

#### Speedup in HEATCODE: 1.8x (total: 34x)

More details in the HOW series Session 6.

- Loop tiling: cache blocking or unroll-and-jam
- Cache-oblivious parallel recursion
- Loop fusion

### Speedup in HEATCODE: 2.4x (total: 82x)

More details in the HOW series Session 9.

- Data alignment
- Alignment hint
- Guiding the compiler with #pragma simd

## Speedup in HEATCODE: 1.1x (total: 91x)

More details in the HOW series Session 6.

# §6. Summary

## **Performance Results**



## Illustration



## Illustration: voutu.be

Paper: http://xeonphi.com/papers/heatcode

Summary

15

## **Optimization Methodology**

### **Optimization Areas**

- Scalar optimization
- Vectorization
- Multi-threading
- Memory access
- Communication

Details in HOW Series + our book.

### How to Navigate Them

 "Hotspot-guided optimization": iteratively use VTune, scalability tests + optimization report

## Alternative Optimization Methodology



#### Source: Parallel Universe Magazine, issue 23

## **§7.** Resources

## Supplementary Materials: Textbook

ISBN: 978-0-9885234-0-1 (2nd edition, 508 pages, Electronic or Print)

Parallel Programming and Optimization with Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessors

Handbook on the Development and Optimization of Parallel Applications for Intel<sup>®</sup> Xeon<sup>®</sup> Processors and Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessors

© Colfax International, 2015



### **Colfax Research**



http://colfaxresearch.com/

#### Learn More



\*10x 2-hour sessions | 24-hour 2-weeks remote access to a system | Filling up fast, register now!

#### Interested? Sign-up at:

colfaxresearch.com/how-series

Resources

## Developer's Guide to Knights Landing



colfaxresearch.com/knl-webinar/

## Slides, Code, Video

You can download slides, code and watch the video recording of this webinar here (requires registration for a free Colfax Research account):

## colfaxresearch.com/how-tools-16-06

Next webinar on June 15, 2016: "Practical usage of Intel Math Kernel Library: performance tuning tips and usage with coprocessors":

Register