Accelerating Public Domain Applications: Lessons from Models of Radiation Transport in the Milky Way Galaxy

November 25, 2013

Last week I had the privilege of giving a talk at the Intel Theater at SC’13. I presented a case study done with Stanford University on using Intel Xeon Phi coprocessors for accelerating a new astrophysical library HEATCODE (HEterogeneous Architecture library for sTochastic COsmic Dust Emissivity). If this talk can be summarized in one sentence, that will be “One high performance code for two platforms is reality“. Indeed, the optimizations performed in order to optimize HEATCODE for the MIC architecture lead to a tremendous performance increase on the CPU platform. As a consequence, we have developed a high performance library which can be employed and modified both by users who have access to Xeon Phi coprocessors, and by those only using multi-core CPUs. The paper introducing HEATCODE library with details of the optimization process is under review at Computer Physics Communications. The preliminary manuscript can be obtained from arXiv, and the slides of the talk are available on this page (see links above and below). The open source code will be made available [...]

Heterogeneous Clustering with Homogeneous Code: Accelerate MPI Applications Without Code Surgery Using Intel Xeon Phi Coprocessors

October 17, 2013

This paper reports on our experience with a heterogeneous cluster execution environment, in which a distributed parallel application utilizes two types of compute devices: those employing general-purpose processors, and those based on computing accelerators known as Intel Xeon Phi coprocessors. Unlike general-purpose graphics processing units (GPGPUs), Intel Xeon Phi coprocessors are able to execute native applications. In this mode, the application runs in the coprocessor’s operating system, and does not require a host process executing on the CPU and offloading data to the accelerator (coprocessor). Therefore, for an application in the MPI framework, it is possible to run MPI processes directly on coprocessors. In this case, coprocessors behave like independent compute nodes in the cluster, with an MPI rank, peer-to-peer communication capability, and access to a network-shared file system. With such configuration, there is no need to instrument data offload in the application in order to utilize a heterogeneous system comprised of processors and coprocessors. That said, an [...]

Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors

August 12, 2013

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is usually lower. The ratio of the transposition rate to the memory copy bandwidth is a measure of the transposition algorithm efficiency. This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition. For large matrices, it achieves a transposition rate of 49 GB/s (82% efficiency) on Intel Xeon CPUs and 113 GB/s (67% efficiency) on Intel Xeon Phi coprocessors. The code is tuned with pragma-based compiler hints and compiler arguments. Thread parallelism in the code is handled by OpenMP, and vectorization is automatically implemented by the Intel compiler. This approach allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency. For benchmarks, an Intel Xeon Phi 7110P [...]

Accelerated Simulations of Cosmic Dust Heating Using the Intel Many Integrated Core Architecture

June 7, 2013

Cosmic dust absorbs starlight in the optical and ultraviolet ranges, and re-emits it in the infrared range. This process is crucial for radiative transport in our Galaxy. I am participating in a project to develop a computational tool for Galactic radiative transport simulation with stochastic light absorption and re-emission on small dust grains. This project has resulted in the development of a library called HEATCODE (HEterogeneous Architecture library for sTochastic COsmic Dust Emissivity) for fast calculation of the stochastic dust heating process using Intel Xeon Phi coprocessors. I presented HEATCODE and shared my experiences with the development and optimization of applications for Xeon Phi coprocessors in a talk at the Applied Mathematics and Statistics Department at UCSC. The slides from this talk can be downloaded here (see below). The full source code of the application, along with a detailed description of the optimization process, will soon be submitted for peer-reviewed publication, and will become publicly available. Slides from the talk: [...]

How to Write Your Own Blazingly Fast Library of Special Functions for Intel Xeon Phi Coprocessors

May 3, 2013

Statically-linked libraries are used in business and academia for security, encapsulation, and convenience reasons. Static libraries with functions offloadable to Intel Xeon Phi coprocessors must contain executable code for both the host and the coprocessor architecture. Furthermore, for library functions used in data-parallel contexts, vectorized versions of the functions must be produced at the compilation stage. This white paper shows how to design and build statically-linked libraries with functions offloadable to Intel Xeon Phi coprocessors. In addition, it illustrates how special functions with scalar syntax (e.g., y=f(x)) can be implemented in such a way that user applications can use them in thread- and data-parallel contexts. The second part of the paper demonstrates some optimization methods that improve the performance of functions with scalar syntax on the multi-core and the many-core platforms: precision control, strength reduction, and algorithmic optimizations. Complete paper:  Colfax_Static_Libraries_Xeon_Phi.pdf (426 KB) — this file is available only to [...]

Cache Traffic Optimization on Intel Xeon Phi Coprocessors for Parallel In-Place Square Matrix Transposition with Intel Cilk Plus and OpenMP

April 25, 2013

Numerical algorithms sensitive to the performance of processor caches can be optimized by increasing the locality of data access. Loop tiling and recursive divide-and-conquer are common methods for cache traffic optimization. This paper studies the applicability of these optimization methods in the Intel Xeon Phi architecture for the in-place square matrix transposition operation. Optimized implementations in the Intel Cilk Plus and OpenMP frameworks are presented and benchmarked. Cache-oblivious nature of the recursive algorithm is compared to the tunable character of the tiled method. Results show that Intel Xeon Phi coprocessors transpose large matrices faster than the host system, however, smaller matrices are more efficiently transposed by the host. On the coprocessor, the Intel Cilk Plus framework excels for large matrix sizes, but incurs a significant parallelization overhead for smaller sizes. Transposition of smaller matrices on the coprocessor is faster with OpenMP. COMMENTS: If you are interested in this paper, make sure to also read a follow-up publication (improved [...]

Test-driving Intel® Xeon Phi™ coprocessors with a basic N-body simulation

January 7, 2013

Intel® Xeon Phi™ coprocessors are capable of delivering more performance and better energy efficiency than Intel® Xeon® processors for certain parallel applications. In this paper, we investigate the porting and optimization of a test problem for the Intel Xeon Phi coprocessor. The test problem is a basic N-body simulation, which is the foundation of a number of applications in computational astrophysics and biophysics. Using common code in the C language for the host processor and for the coprocessor, we benchmark the N-body simulation. The simulation runs 2.3x to 5.4x times faster on a single Intel Xeon Phi coprocessor than on two Intel Xeon E5 series processors. The performance depends on the accuracy settings for transcendental arithmetics. We also study the assembly code produced by the compiler from the C code. This allows us to pinpoint some strategies for designing C/C++ programs that result in efficient automatically vectorized applications for Intel Xeon family devices. The visualization shown below demonstrates the results and the performance of the [...]

Squeezing More Instructions per Cycle out of the Intel Sandy Bridge CPU Pipeline

July 31, 2012

Parallelism in modern CPU architectures is supported at hardware level by multiple cores, vector registers, and pipelines. While the utilization of the former two is a shared responsibility of the programmer and the compiler, pipelining is handled completely by the processor. It is, however, useful for the developer to know what types of workloads optimize pipeline utilization. This paper shows one example where a specific workload improves the number of instructions executed per clock cycle, boosting arithmetic performance. This workload is comprised of two independent data processing tasks, one performing the AVX addition instruction and the other — the AVX multiplication instruction. Even though these tasks are executed sequentially on one core, alternating additions and multiplications in the code allows the CPU to complete the task 40% faster than when a sequence of additions is followed by a sequence of multiplications. Such workloads are common in linear algebraic applications. Examples in the paper illustrate how improved performance can be achieved in portable C code [...]

Scientific Computing in a Web Browser: GALPROP WebRun

June 30, 2012

As scientific software tools become increasingly complex and computationally demanding, sharing the source code of a scientific project with the community may be insufficient to support peer interest and ensure the appropriate use of the tools. In order to facilitate the use of astrophysical code GALPROP, our group has launched a public online service named GALPROP WebRun. This service, live since August 2010, includes: the ability to configure GALPROP computing tasks in a Web browser; access to a dedicated computing cluster and precompiled binaries for code execution; and user support in the form of online documentation, automated validation tools, and forum/bug reporting online software. This paper reports the details and status of the GALPROP WebRun project as well as our experience with it. Complete paper:  Colfax_Galprop_WebRun.pdf (2 MB) — this file is available only to registered users. Register or Log [...]

Arithmetics on Intel’s Sandy Bridge and Westmere CPUs: not all FLOPs are created equal

April 30, 2012

This paper presents a new arithmetic efficiency benchmark and uses it to compare the Intel Sandy Bridge E5-2680 CPU to the Intel Westmere X5690 CPU performance. The efficiency is measured for single and double precision floating point operations: addition, multiplication, division, square root and the exponential function, and for 32- and 64-bit integer operations: addition, multiplication and division. The SSE2 and AVX instruction sets, as well as scalar operations, in single-threaded and multi-threaded modes are covered. This benchmark eliminates the effects of memory bandwidth and latency by fitting the calculation in the L1 cache. The bandwidth of the L1 cache and main memory (RAM) are estimated for reference, and the LINPACK benchmark result is reported. Results show that the E5-2680 CPU performs floating point addition and multiplication dramatically faster (up to 2.6x) than the X5690 model. However, the floating point division and square root are the new model’s weak spots. AVX floating point operations addition and multiplication are up to 2.0x faster than the SSE2; [...]
1 2 3 4 5