#### Accelerated Simulations of Cosmic Dust Heating Using the Intel Many Integrated Core Architecture

Andrey Vladimirov (avladim@stanford.edu) Hansen Experimental Physics Laboratory, Stanford University work with: Troy Porter, HEPL, Stanford

Talk given at UC Santa Cruz, Applied Mathematics and Statistics, June 6, 2013

### Interstellar Radiation Field (ISRF)

- Most energy in the infrared (IR), optical and ultraviolet (UV) ranges.
- Sources: stars, dust.
- Processes: elastic scattering, absorption by dust, re-emission
- Precise ISRF modeling is necessary for the studies of:
  - The interstellar medium (ISM)
  - Cosmic ray (CR) propagation
  - Extragalactic background light (EBL)
  - Dust-obscured objects (for background elimination)



### FRaNKIE Code

- "Fast RAdiative transfer Numerical Kode for Interstellar Emission"
- Monte Carlo transport of photons (broadband) in the Galaxy
- Physical input: distribution of stars and dust, microscopic processes
- Output: 3D ISRF density, sky maps
   + wavelength dependence
- References: Porter et al. (2008) ApJ, 682, 400, Ackermann et al. (2012) ApJ 750, 3
- Results used in the GALPROP code for cosmic ray transport



#### **Interstellar Dust**

- Conglomerates of atoms (H, C, O, Si, etc.) and molecules, 0.5– 1000 nm = 10<sup>1</sup>-10<sup>10</sup> particles
- Absorbs and scatters optical & UV light
- Optically heated dust re-emits energy in IR
- Important for H<sub>2</sub> production, star formation



SEM images of interplanetary dust Jessberger et al. (2001)



Size distribution of interstellar dust Weingartner & Draine (2001)

#### **Interstellar Dust**

- Optical depth toward Galactic center >> 1 (significant attenuation, wavelength-dependent)
- Must be included in radiative transport (RT) simulations
- Dust heating and RT must be treated selfconsistently (local dust heating by propagated photons + propagation of re-emitted IR photons)

| infrared            |
|---------------------|
| mid-infrared is for |
| near infrared       |
| optical et al.      |





40 kiloparsecs

#### **Stochastic Dust Grain Heating**

- Large grains are heated by photon absorption attaining thermal equilibrium with the heating radiation field (characterizable by a single temperature, easy to model)
- For very small grains (≤0.1 µm), absorption and re-emission is stochastic (grains undergo "temperature" spikes, characterized by temperature distribution — evaluation computationally expensive)

Vibrational energy ("grain temperature") Surroo (UV absorption) transitions Heating transitions R emission

### Matrix Formalism for Stochastic Dust Emissivity Calculation

- Input data: incident electromagnetic radiation field
- Intermediate: "temperature" distribution of grains of all sizes
- Output: spectrum of re-emitted photons
- Carbonaceous, silicate and PAH grains (polycyclic aromatic hydrocarbons)
- Method and absorption cross sections: Draine et al. (2001), ApJ, 551, 807



#### Matrix Formalism for Stochastic Dust Emissivity Calculation

• Stage 1:

Interpolate (in log space) and convolve the incident RF with the photon absorption cross sections

#### • Stage 2:

form and solve a quasitriangular system of linear algebraic equations for the "temperature" distribution

convolve the "temperature" distribution with the grain size distribution and emissivity function

$$T_{ul} = I(\lambda)\sigma(\lambda)\frac{\lambda^{3}\Delta E_{ul}}{hc^{2}} \quad \text{for} \quad u > l.$$
$$I(\lambda)\sigma(\lambda) \equiv \Omega(\lambda)$$
$$\log\left[\frac{\Omega(\lambda)}{\Omega(\lambda_{j-1})}\right] = \frac{\log\left(\lambda/\lambda_{j-1}\right)}{\log\left(\lambda_{j}/\lambda_{j-1}\right)}\log\left[\frac{\Omega(\lambda_{j})}{\Omega(\lambda_{j-1})}\right]$$

transcendental operations

$$\sum_{j \neq i} T_{ij} P_j - \sum_{j \neq i} T_{ji} P_i = 0$$
$$T_{ij} = 0, \quad \text{if} \quad i < j - 1$$
$$B_{fj} = \sum_{k=f}^M T_{kj} \quad (f > j)$$
$$X_f = \frac{1}{T_{(f-1)f}} \sum_{j=0}^{f-1} B_{fj} X_j$$

sparse memory access

$$\nu F_a(\nu) = \sigma(\nu) \sum_{i=0}^{M} P_i(a) \Lambda(\nu, E_i)$$
$$\Lambda(\nu, E_i) = \begin{cases} 0, & \text{if } E_i < h\nu, \\ \frac{2h\nu^4}{c^2} \frac{P_i}{\exp(h\nu/kT_i) - 1} \\ \nu F(\nu) = \int_{a_{\min}}^{a_{\max}} \nu F_a(\nu) Q(a) da \end{cases}$$

dense linear algebra

### **Computational Challenge**

- <u>What</u>: Constrain the geometrical/compositional parameters of the distribution of light sources (stars) in the Galaxy.
   <u>How</u>: Bayesian analysis of sky survey data. The analysis fits the results of the FRaNKIE code within a parameter space to the observational data.
- Need of order 10^5 FRaNKIE evaluations with 10^5 cells
- <u>Difficulty</u>: stochastic dust heating must be computed for every simulation cell in the Galaxy consistently with the RF.
- Bottlenecked by stochastic emissivity calculation: 60 ms per cell on a modern 16-core Intel Xeon E5 server.
- Translates to 20 machine-years for the calculation too much.

# HEATCODE (HEterogeneous Architecture library for sTochastic COsmic Dust Emissivity)

#### What we need:

- Optimize the stochastic emissivity calculation to employ available compute resources more efficiently
- Use high performance computing accelerators if available
- Operate on a wide range of computing platforms for public distribution

#### Our solution:

- A new library called HEATCODE (started from unoptimized implementation of the Draine et al. matrix formalism)
- Optimized for Intel Xeon multi-core architecture, suitable for any CPU
- After optimization, 100x more efficient on the same hardware (2x Intel Xeon E5-2680 CPUs)
- Additional 4.5x with two Intel Xeon Phi 5110P coprocessors
- Support for GPGPUs can easily be added

#### **Preview of Results**



### Intel MIC Architecture



- Runs its own Linux OS
- Hosts a virtual file system
- IP-addressable

Intel Xeon Phi coprocessor: to accelerate applications that have reached the parallel scaling limits of Intel Xeon processors
PCIe v 2.0 device
Nominal: 245-300 W, ~1 TFLOP/s in double precision, 354 GB/s bandwidth
60 dual-issue in-order cores @1 GHz with 4-way hyper-threading (240 logical cores)
8 GB onboard GDDR5
512-bit SIMD instructions

 The same languages (C, C++, Fortran), tools (compilers, profilers) and optimization methods as general-purpose Intel CPUs

#### **Programming Models for the MIC Architecture**



### **Explicit Offload Model**

- HEATCODE uses the explicit offload model
- Simplicity, compatibility with the CUDA approach, fall-back to host

```
#pragma offload_attribute (push, target(mic))
    void InterpolateWeightedRF(const int wlBins, float* RF, ...) {
        /* ... one implementation for both the host and the MIC */
    }
#pragma offload_attribute(pop)
void CalculateTransientEmissivity( ... ) {
    RF = (float*)malloc(wlBins*nSpectra*sizeof(float));
#pragma offload target(mic) inout(RF: length(wlBins*nSpectra))
    { /* run on the coprocessor if available, otherwise on host */
    InterpolateWeightedRF(wlBins, RF, ...);
    ...
    }
}
```

#### **Threading Models for the MIC Architecture**



### Threading Optimization for the MIC Architecture

- Qualitatively, MIC requires the same optimizations as multi-core CPUs:
  - Avoid synchronization
  - Eliminate false sharing
  - Choose optimal scheduling mode
  - Avoid serial operations
- Quantitatively, more parallelism: MIC applications must scale to 240 threads
  - Increase the the number of parallel tasks to keep all threads busy
  - Reduce per-thread memory overhead if problem does not fit in memory
  - Set appropriate thread affinity

#### HOW TO ACHIEVE:

- Use reduction instead of mutexes
- Padding, thread-private containers
- Load balance vs sched. overhead
- Parallelize algorithm, use static memory allocation: malloc() is serial
- Collapse nested loops or rethink parallelization strategy
- Change algorithm or use nested parallelism within tasks
- In OpenMP, use KMP\_AFFINITY

### **Threading Issues in HEATCODE**

- Unoptimized code used scratch space liberally. This is OK on host, but a limitation on MIC.
- Before optimization, we had to reduce the # of threads on the MIC to fit in 8 GB memory
- Optimization: reducing per-thread memory overhead by loop fusion



RadiationFieldToTemperatureDistribution() {

CalculateMatrices

... InterpolateWeightedRadiationField ... \*/

weightedRadiationField

\*/

for (int i = 0; i < gIMax; i++)</pre>



#### Optimized (loop fusion)

/\*

#### Why Care about Vectorization

- In older architectures, SIMD registers were narrow (64-, 128-bit), and scientists often must use double precision (64-bit) => Without vectorization, one loses up to 2x — often not significant enough
- On the MIC architecture, 512-bit vectors => Without vectorization, one pays a 8x penalty in double precision (16x in single precision)

| Instruction Set | Year and Intel Processor | Vector    |
|-----------------|--------------------------|-----------|
|                 |                          | registers |
| MMX             | 1997, Pentium            | 64-bit    |
| SSE             | 1999, Pentium III        | 128-bit   |
| SSE2            | 2001, Pentium 4          | 128-bit   |
| SSE3–SSE4.2     | 2004 - 2009              | 128-bit   |
| AVX             | 2011, Sandy Bridge       | 256-bit   |
| AVX2            | 2013, (future) Haswell   | 256-bit   |
| IMCI            | 2012, Knights Corner     | 512-bit   |

Table credit: "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", Colfax Andrey Vladimirov, Stanford University — "Accelerated Simulations of Cosmic Dust Heating Using the Intel Many Integrated Core Architecture" — June 6, 2013

### Optimization for the Intel MIC Architecture: (Automatic) Vectorization

- Intel Xeon E5 (Sandy Bridge) architecture: 256-bit AVX vector instructions, legacy SSE, MMX...
- Intel Xeon Phi (Knights Corner) architecture: 512-bit IMCI instructions
- Xeon Phi does not understand AVX or SSE
- Explicit SIMD coding is possible, but automatic vectorization is more portable and flexible & efficient

/\* Fragment of the solution for temperature distribution P\_i from B\_ij in the HEATCODE library with automatic vectorization \*/ #pragma vector aligned for (int i = 0; i < tempBins; ++i) sum += bMatrix[f\*tempBins + i]\*x[i];

```
avladim@dublin$ # Auto-vectorization report
avladim@dublin$ icpc -c -vec-report3 \
    TransientHeatingFunctionsXeonPhi.cc
```

```
TransientHeatingFunctionsXeonPhi.cc(199):
(col. 6) remark: LOOP WAS VECTORIZED.
```

TransientHeatingFunctionsXeonPhi.cc(199): (col. 6) remark: \*MIC\* LOOP WAS VECTORIZED. TransientHeatingFunctionsXeonPhi.cc(199):

```
(col. 6) remark: *MIC* REMAINDER LOOP
WAS VECTORIZED.
```

. . .

## **Optimizing Vectorization**

- Contiguous memory access works best
- Align arrays on 64-byte boundary
- The compiler may need hints (pragma directives)
- Avoid type conversions
- Index notation better than pointer references
- No need to precompute array indices

/\* Recurrent calculation of B\_ij
 From T\_ij in the HEATCODE library

Programmer guarantees data alignment, so the compiler does not have to implement runtime alignment checks.

```
Loop count estimate helps the compiler
to pick the optimal vectorization
strategy. */
#pragma vector aligned
#pragma loop count min(16)
for (int i = 0; i < iMax; ++i) {
   rSum[i] += bMatrix[f*tempBins + i];
   bMatrix[f*tempBins + i] = rSum[i];
}
```

### **Optimizing Vectorization**

- 512 bits vector fits 16 single precision FP numbers
- HEATCODE: padded loop bounds to a multiple of 16 iterations



Figure B.22: Pattern of nested loops in f and i in the first example in Figure B.21 before and after optimization. The optimized loop pattern always has a multiple of 16 iterations in the inner vectorized loop, which is beneficial for performance.

#### Optimization for the Intel MIC Architecture: Cache Traffic

- MIC architecture has a similar cache structure to a multi-core CPU
- To minimize cache misses, maximize data locality and re-use
- This is usually done by changing the order of memory accesses:
  - Fusion of loops
  - Nested loop interchange (permutation)
  - Loop tiling (blocking)
  - Cache-oblivious recursion

| Cache line size                     | 64B                                   |
|-------------------------------------|---------------------------------------|
| L1 size                             | 32KB data, 32KB code                  |
| L1 latency                          | 1 cycle                               |
| L2 size                             | 512KB                                 |
| L2 ways                             | 8                                     |
| L2 latency                          | 11 cycles                             |
| Memory $\rightarrow$ L2 prefetching | hardware and software                 |
| $L2 \rightarrow L1$ prefetching     | software only                         |
| Translation Lookaside Buffer(TLB)   | 64 pages of size 4KB (256KB coverage) |
| coverage options (L1, data)         | 8 pages of size 2MB (16MB coverage)   |

#### Table credit: Colfax

### Loop Tiling Explained



### **Cache Traffic Optimization in HEATCODE**

```
/* Convolution of temperature distr.
with emissivity function in the
HEATCODE library (UNOPTIMIZED) */
for (int i = 0; i < wlBins; ++i) {</pre>
 float sum = 0.0f;
 for (int j = 0; j < qIMax; ++j) {
  const float scaling = ...[i,j];
  float result = 0.0f:
  for (int k = 0; k < \text{tempBins}; ++k)
   result +=
            planck[i*tempBins + k]*
      distribution[j*tempBins + k];
  sum += result*scaling;
 trans[i] = sum*wavelength[i]*units;
}
1 "Before"
                             "After"
```

```
/* OPTIMIZED w/double loop tiling */
for (int jj=0; jj<gIMax; jj+=jTile) {</pre>
for (int ii=0; ii<wlBins; ii+=iTile){</pre>
 float result[iTile*jTile];
 for (int c = 0; c<iTile*jTile; c++)</pre>
  result[c] = 0.0f;
#pragma simd
 for (int k = 0; k < \text{tempBins}; ++k)
  for (int c = 0; c < iTile; c++) {
  result[(0)*iTile + c] +=
      distribution[(jj+0)*tempBins+k]*
          planck[(ii+c)*tempBins+k];
    result[(1)*iTile + c] +=
      distribution[(jj+1)*tempBins+k]*
         planck[(ii+c)*tempBins+k];
    result[(2)*iTile + c] +=
      distribution[(jj+2)*tempBins+k]*
         planck[(ii+c)*tempBins+k];
    result[(3)*iTile + c] +=
      distribution[(jj+3)*tempBins+k]*
            planck[(ii+c)*tempBins+k];
```

### **Memory Traffic Optimization in HEATCODE**





#### Optimization for the Intel MIC Architecture: Offload Data Traffic

- Upon offload, data are transferred across the PCIe bus: ~6 GB/s
- Whenever possible, retain data on coprocessor between offloads
- Memory allocation on coprocessor is slow (a serial operation): ~1 GB/s
- Whenever possible, retain allocated memory on coprocessor between offloads



#### Optimization for the Intel MIC Architecture: Offload Data Traffic

| <pre>/* Offload pragma in HEATCODE, data marshaling directives */ #pragma offload target(mic)</pre> | <pre>/* Offload pragma in HEATCODE, optimized using data and memory persistence */ #pragma offload target(mic:iDevice)</pre>                     |
|-----------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| <br>in(rfArray : \<br>length(n*rfBins)) \<br>out(emissivityArray : \<br>length(n*rfBins)) \         | <pre> in(rfArray : \     length(n*rfBins) alloc_if(0) free_if(0)) \ out(emissivityArray : \     length(n*rfBins) alloc_if(0) free_if(0)) \</pre> |
| <pre> in(absorptionCrossSection : \     length(gIMax*wlBins)) }</pre>                               | <pre> in(absorptionCrossSection : \    length(0) alloc_if(0) free_if(0)) }</pre>                                                                 |

#### **Unoptimized:**

For every offload,

- Sending/receiving input & output
- Sending/receiving model data
- Allocating/deallocating memory

#### <u>Optimized:</u>

For every offload,

- Sending/receiving input & output
- Re-using offloaded model data
- Re-using allocated memory
- Requires initial offload and cleanup (not shown here)

#### Optimization for the Intel MIC Architecture: Heterogeneous Work Sharing

- In our compute nodes: a 2-socket Xeon CPU + two Xeon Phi
- We would like to use all available compute power
- One Xeon Phi is ~2x faster than CPU  $\rightarrow$  can't split work evenly
- Must split up work into chunks and use "boss-worker" scheduling
- Easy solution using the OpenMP scheduler: #pragma omp parallel for n\_threads(3) schedule(dynamic,1) for (int i = 0; i < nChunks; i++) { int iDevice = omp\_get\_thread\_num(); #pragma offload target(mic: iDevice) if (iDevice > 0)



#### Other Optimization Considerations for HEATCODE

- Precomputation often helps; however, sometimes reading a precomputed value is more expensive than computing on the fly
- Avoiding type conversion:
  - consistently use single or double precision variables
  - specify single precision constants as "0.0f", "1.0f", etc.
  - use single-precision math functions: sinf(), expf(), fabsf()...
- Use base 2 logarithms and exponentials: exp2f(), log2f()
- Use -fimf-domain-exclusion=15 if do not need denormals, NaNs...
- Set MIC\_USE\_2MB\_BUFFERS to improve TLB traffic
- Potentially: use the Intel Math Kernel Library (MKL) for matrix multiplication. MKL has a number of standard routines (xGEMM, FFT, random numbers, etc) optimized for Intel Xeon Phi coprocessors.

### **Performance Benchmarks: Optimization**



### **Performance Benchmarks: Platforms**

- HEATCODE on one Xeon Phi vs two Xeon E5-2680s: 1.9x speedup
- Why against *two* CPUs?
   Same power ~ 250 W
- Synthetic benchmarks: SGEMM 2.9x, LINPACK 2.6x, STREAM 2.2x
- Before optimization, the (parallel) code was 100x slower on host and 400x slower on MIC



#### **Developer Experience: Programming**

- Initial porting is trivial with native execution: "-mmic" compiler argument. Works for open source packages and autotools as well. Example:
   ./configure --prefix=~/mic/xerces-c --without-curl --enable-transcoder-iconv \ CC="icc" CXX="icpc" CFLAGS="-mmic" CXXFLAGS="-mmic" --host=x86 64
- Explicit offload model is straightforward, but must pack all data into arrays
- Initially, the ported code was miserable on the coprocessor. However, it meant that it was not doing very well on the host, either.
- Optimizations for coprocessor lead to better performance on the host, and vice-versa. Coprocessor/host performance ratio is a measure of efficiency.
- Areas of optimization: thread scalability, vectorization, scalar efficiency, cache traffic, communication with coprocessor.
- If don't know where to optimize, use VTune.

#### **Developer Experience: Development Tools**

- Must use Intel compilers (\$500-1500 single user academic license)
- Intel C++ compiler beats GCC by 3x (HEATCODE on the host)
- VTune: performance analysis with hardware event sampling. Works on Intel CPUs and Xeon Phi. The greatest thing since sliced bread. Can find hotspots down to a single line of code.
- Debugger is available, but "printf debugging" works, too: console output from the coprocessor is piped to host
- Intel MKL has a lot of optimized routines for Xeon Phi. Binaries are redistributable

#### Intel Vtune Parallel Amplifier XE

#### General Exploration - Knights Corner Platform

Identify where microarchitectural issues affect the performance of your application. Press F1 for more details.

- 🗹 Analyze general cache usage
- Analyze vectorization usage
- Analyze TLB misses
- Analyze additional L2 cache events

| Function / Call Stack                       | ¢ CPU Time |
|---------------------------------------------|------------|
| thXeonPhi::RadiationFieldToTemperatureDistr | 659.011s   |
| ♦ thXeonPhi::CalculateEmissivity            | 202.414s   |
| intel_lrb_memset                            | 124.030s   |
| kmp_wait_sleep                              | 79.249s 🔲  |
| _kmp_static_yield                           | 46.179s    |
| ▶kmp_yield                                  | 5.722s     |

| 239 | #ifdef HAVE_ICC                                     |         |
|-----|-----------------------------------------------------|---------|
| 240 | <pre>#pragma simd reduction(+: sum)</pre>           |         |
| 241 | #pragma vector aligned                              |         |
| 242 | #endif                                              |         |
| 243 | for (int i = 0; i < tempBins; ++i)                  | 8.850s  |
| 244 | <pre>sum += bMatrix[f*tempBins + i]*x[i];</pre>     | 70.599s |
| 245 |                                                     |         |
| 246 | <pre>// rTransientMatrixOverDiagonal contains</pre> |         |
| 247 | // (or zeroes if enthalpyDelta == θ, whi            |         |
| 248 | <pre>x[f] = sum*rTransientMatrixOverDiagonal[</pre> | 4.325s  |
| 240 |                                                     |         |

#### **Developer Experience: Optimization**

- Is it easy to port applications to the MIC architecture? Yes
   Do I get accelerated performance out of the box? Likely, No
- Same code on the host and the MIC? True Same optimization for the host and the MIC? In many cases, true
- For HEATCODE, optimization involved:
  - Loop fusion to reduce memory footprint, improve data locality
  - Floating point precision consistency (constants, variables, functions)
  - Strength reduction & precomputation in common expressions
  - Nested loop permutation and tiling to improve cache traffic
  - SIMD loop bounds and data alignment: pad to a multiple of 64 bytes
  - Vectorization tuning w/pragmas (loop count, aligned notice, ...)
  - Data & memory persistence on the coprocessor (PCIe traffic)

### Learning Resources on the Intel MIC Architecture Programming

#### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor High Performance Programming

Jim Jeffers, James Reinders

#### PARALLEL PROGRAMMING AND OPTIMIZATION WITH INTEL<sup>®</sup> XEON PHI<sup>™</sup> COPROCESSORS

HANDBOOK ON THE Development and optimization of parallel applications for intel "Xeon" processors and intel" Xeon Phi" Coprocessors

**GOLFAX INTERNATIONAL** Foreword by James Reinders, Intel Corporation

on Amazon.com: Jeffers & Reinders

#### http://colfax-intl.com/xeonphi/

### **Learning Resources** on the Intel MIC Architecture Programming

#### http://software.intel.com/mic-developer



| Get Link   Sync TOC   <<   >>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | ss/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/index.htm Intel® C++ Compiler XE 13.1 User and Reference Guides                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                                                       |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|--|
| Intel® C++ Compiler XE 13.1 User and R Iggal Information Introducing the Intel® C++ Compiler Getting Help and Support Related Information Getting Started Getting Started Getting Started Intel® MIC Architecture Overview: Intel® MIC Architecture Derogramming for Intel® MIC Architecture Building for Intel® MIC Architecture Guided Auto Paralleliam Guided Auto Paralleliam High-Level Optimization (HLO) Intel® Click <sup>™</sup> Plus Intel Math Library Intel Intel Math Library Intel Intel Math Library Intel Intel Math Library | This topic only applies to Intel® Many Integrated Core Architecture (Intel® MIC A The Intel® compiler provides several elements to enable programming for and bu Architecture), including:  I language extensions Compiler options |                                                                       |  |
| OpenMP* Support     Supporting C++11 Lambda Feature                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Description                                                           |  |
| Pointer Checker     Processor Targeting     Profile-cuided Optimization (PGO)     Static Analysis     Tools     Compatibility and Portability     Compliation     Compliation     Compliate Reference                                                                                                                                                                                                                                                                                                                                        | offload pragma<br>offload_attribute<br>pragma<br>offload_transfer<br>pragma<br>offload_wait pragma                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | Pragmas to control the data transfer between the CPU ar               |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | _Cilk_offload keyword                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Keywords to control the data transfer between the CPU a piler XE 13.1 |  |

by Sumedh Naik (Intel Mon. 06/03/2013 - 14:02 Andrey Vladimirov, Stanford University — "Accelerated Simulations of Cosmic Dust Heating Using the Intel Many Integrated Core Architecture" — June 6, 2013

- Forum Tools

by PONRA Sat, 04/20/2013 - 07:16

by Sumedh Naik (Intel) Thu, 03/07/2013 - 08:35

by Sumedh Naik (Intel) Fri, 02/22/2013 - 11:25

by Sumedh Naik (Intel) Tue, 01/22/2013 - 09:52

by Frances Roth (Intel)

Tue, 06/04/2013 - 10:58

Mon. 06/03/2013 - 18:21

by Keyin Davis (Intel)

by Tinway Che

by Ravi Naray

5 by Chris Sam

### Summary

- HEATCODE a new library for fast calculation of stochastic cosmic dust grain heating and emissivity, with support for the Intel MIC architecture and heterogeneous multi/many-core systems
- Optimization for the MIC architecture leads to significant performance benefits on the host multi-core CPUs, and vice-versa
- One code for CPUs and Intel Xeon Phi coprocessors
- Publication for Computer Physics Communications in preparation
- Source codes will be publicly available via the CPC Program Library

#### ACKNOWLEDGEMENT

We thank Colfax International and Intel for early access to Intel Xeon Phi coprocessors and optimization guides



### **Backup Slides**

#### Optimization for the Intel MIC Architecture: Heterogeneous Work Sharing



#### **Intel MIC Architecture**



#### **Diagram credit: Intel**