Publications

Optimization Techniques for the Intel MIC Architecture. Part 3 of 3: False Sharing and Padding

August 8, 2015

This is part 3 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we discuss false sharing, highlighting the situations in which it may occur, and eliminating it with the help of data container padding. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Similar workloads occur in Monte Carlo simulations, particle physics software, and statistical analysis. Results show that the impact of false sharing may be as high as an order of magnitude performance loss in a parallel application. On Intel Xeon processors, padding required to eliminate false sharing is greater than on Intel Xeon Phi coprocessors, so target-specific padding values may be used in real-life applications. See also: Part 1: Multi-Threading and Parallel Reduction Part 2: Strip-Mining for Vectorization Part 3: False Sharing and Padding Complete paper: [...]

Software Developer’s Introduction to the HGST Ultrastar Archive Ha10 SMR Drives

July 31, 2015

In this paper we will discuss the new HGST Shingled Magnetic Recording (SMR) drives, Ultrastar Archive Ha10, which offers storage capacities of 10 TB and beyond. With their high-density storage capacities, these drives are well suited for large “active archive” applications. In an active archive application, the data is frequently read but seldom modified. The SMR drives are host managed, meaning that the developer must manage the data storage on the drives. In this publication we introduce an open source library, libzbc, which was developed by the HGST team to assist developers who use SMR drives. The discussions cover topics from the very basics like opening a device, to more advanced topics like data padding. The goal of this paper is to give readers the necessary knowledge and tools to develop applications with libzbc. We will present an example, and then report several benchmarks of I/O operations on the HGST SMR drives, and discuss the SMR drive’s effectiveness as an active archive solution. Complete paper:  HGST_Introduction_to_libzbc.pdf (361 KB) — this [...]

Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization

June 26, 2015

This is part 2 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we discuss data parallelism. Our focus is automatic vectorization and exposing vectorization opportunities to the compiler. For a practical illustration, we construct and optimize a micro-kernel for particle binning particles. Similar workloads occur applications in Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to code vectorization, which results in an order of magnitude performance improvement on an Intel Xeon processor. Performance on Xeon Phi compared to that on a high-end Xeon is 1.4x greater in single precision and 1.6x greater in double precision. See also: Part 1: Multi-Threading and Parallel Reduction Part 2: Strip-Mining for Vectorization Part 3: False Sharing and Padding Complete paper:  Colfax_Optimization_Techniques_2_of_3.pdf (650 KB) [...]

Optimization Techniques for the Intel MIC Architecture. Part 1 of 3: Multi-Threading and Parallel Reduction

May 29, 2015

This is part 1 of a 3-part educational series of publications introducing select topics on optimization of applications for the Intel multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we focus on thread parallelism and race conditions. We discuss the usage of mutexes in OpenMP to resolve race conditions. We also show how to implement efficient parallel reduction using thread-private storage and mutexes. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Workloads like this one occur in such applications as Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to a performance increase of 25x on a 24-core CPU and up to 100x on the MIC architecture compared to a single-threaded implementation on the same architectures. In the next publication of this series, we will demonstrate further optimization of this workload, focusing on vectorization. See also: Part 1: Multi-Threading [...]

Performance to Power and Performance to Cost Ratios with Intel Xeon Phi Coprocessors (and why 1x Acceleration May Be Enough)

January 27, 2015

The paper studies two performance metrics of systems enabled with Intel Xeon Phi coprocessors: the ratio of performance to consumed electrical power and the ratio of performance to purchasing system cost, both under the assumption of linear parallel scalability of the application. Performance to power values are measured for three workloads: a compute-bound workload (DGEMM), a memory bandwidth-bound workload (STREAM), and a latency-limited workload (small matrix LU decomposition). Performance to cost ratios are computed, using system configurations and prices available at Colfax International, as functions of the acceleration factor and of the number of coprocessors per system. That study considers hypothetical applications with acceleration factor from 0.35x to 2x. In all studies, systems with Intel Xeon Phi coprocessors yield better metrics than systems with only Intel Xeon processors. That applies even with acceleration factor of 1x, as long as the application can be distributed between the CPU and the coprocessor. Complete paper:  Colfax_1x.pdf (321 KB) — this file is [...]

Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: LU Decomposition of Small Matrices

January 27, 2015

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel Xeon Phi coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation. In addition, the loop tiling technique of memory traffic tuning is shown. The optimization methods are illustrated on an example of single-threaded LU decomposition of a single precision matrix of size 128×128. Benchmarks show that the discussed optimizations improve the performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor. The code discussed in the paper can be freely downloaded from this page. Complete paper:  Colfax_LU.pdf (604 KB) — this file is available only to registered users. Register or Log In. Source code for Linux: colfax-lu.tgz (17 KB) — this archive is available only to registered users. Register or [...]

Intel Cilk Plus for Complex Parallel Algorithms: “Enormous Fast Fourier Transforms” (EFFT) Library

September 18, 2014

In this paper we demonstrate the methodology for parallelizing the computation of large one-dimensional discrete fast Fourier transforms (DFFTs) on multi-core Intel Xeon processors. DFFTs based on the recursive Cooley-Tukey method have to control cache utilization, memory bandwidth and vector hardware usage, and at the same time scale across multiple threads or compute nodes. Our method builds on single-threaded Intel Math Kernel Library (MKL) implementation of DFFT, and uses the Intel Cilk Plus framework for thread parallelism. We demonstrate the ability of Intel Cilk Plus to handle parallel recursion with nested loop-centric parallelism without tuning the code to the number of cores or cache metrics. The result of our work is a library called EFFT that performs 1D DFTs of size 2^N for N>=21 faster than the corresponding Intel MKL parallel DFT implementation by up to 1.5x, and faster than FFTW by up to 2.5x. The code of EFFT is available for free download under the GPLv3 license. This work provides a new efficient DFFT implementation, and at the same time demonstrates an [...]

Installing Intel MPSS 3.3 in Arch Linux

August 20, 2014

This technical publication provides instructions for installing the Intel Manycore Platform Software Stack (MPSS) version 3.3 in Arch Linux operating system. Intel MPSS is a suite of tools necessary for operation of Intel Xeon Phi coprocessors. Instructions provided here enable offload and networking functionality for coprocessors in Arch Linux. The procedure described in this paper is completely reversible via an uninstallation script. Downloads: Product Direct Link Intel MPSS 3.3 (page, archive) mpss-3.3-linux.tar (~400 MB) Linux Kernel 3.10 LTS (AUR) linux-lts310.tar.gz (78 KB) TRee Installation Generator (TRIG) trig.sh (3 KB) — this archive is available only to registered users. Register or Log In. RHEL networking utilities rhnet.tgz (34 KB) — this archive is available only to registered users. Register or Log In. Offload functionality test Offload-Hello.cc (347 B) — this archive is available only to registered users. Register or Log In. GNU Public License v2 (applies to TRIG and RHEL utilities) page Paper:  Colfax_MPSS_in_Arch_Linux.pdf (97 KB) — [...]

File I/O on Intel Xeon Phi Coprocessors: RAM disks, VirtIO, NFS and Lustre

July 28, 2014

The key innovation brought about by Intel Xeon Phi coprocessors is the possibility to port most HPC applications to manycore computing accelerators without code modification. One of the reasons why this is possible is support for file input/output (I/O) directly from applications running on coprocessors. These facilities allow seamless usage of manycore accelerators in common HPC tasks such as application initialization from file data, saving running output, checkpointing and restarting, data post-processing and visualization, and other. This paper provides information and benchmarks necessary to make the choice of the best file system for a given application from a number of the available options: RAM disks, virtualized local hard drives, and distributed storage shared with NFS or Lustre. We report benchmarks of I/O performance and parallel scalability on Intel Xeon Phi coprocessors, strengths and limitations of each option. In addition, the paper presents system administration procedures necessary for using each file system on coprocessors, including bridged networking and [...]

Colfax Research papers translated to Japanese

July 14, 2014

With the help of our partners at Intel, some of our articles on Intel Xeon Phi coprocessor programming were translated to the Japanese language. インテル社の協力で、弊社のインテル(R) Xeon Phi(TM) コプロセッサーのプログラミングについての白書の一部が日本語に翻訳されました。 Original: Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors Translation:  JP-Colfax_InfiniBand_for_MIC.pdf (2 MB) — this file is available only to registered users. Register or Log In. Original: Heterogeneous Clustering with Homogeneous Code: Accelerate MPI Applications Without Code Surgery Using Intel Xeon Phi Coprocessors Translation:  JP-Colfax_Heterogeneous_Clustering_Xeon_Phi.pdf (657 KB) — this file is available only to registered users. Register or Log In. Original: Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors Translation:  JP-Colfax_Transposition-7110P.pdf (987 [...]
1 2 3 4