Publications

Introduction to Intel DAAL, Part 2: Distributed Variance-Covariance Matrix Computation

March 28, 2016

This is the part 2 of 3 of an introductory series of publications on the Intel® Data Analytics Acceleration Library (DAAL). DAAL is a data analytics library optimized for modern highly parallel computer architectures such as Intel Xeon and Intel Xeon Phi processors. The goal of this series is to provide developers a technical overview for developing applications using DAAL. In part 1 of the series we discussed how to implement batch mode computation on a single node. In the present publication, we discuss the distributed mode computation. Our discussion will focus both on how and when to implement distributed mode computation with Intel DAAL. As an example workload, we implement an application that uses DAAL to compute a covariance matrix of a set of vectors. We first demonstrate how to use distributed mode with this example. Then, using this example application, we scan the parameter space to determine what parameter ranges benefit from distributed computation. We also demonstrate how the output of this computation may be used in image processing to compute the eigenvectors of [...]

Introduction to Intel DAAL, Part 1: Polynomial Regression with Batch Mode Computation

October 28, 2015

This is the part 1 of 3 of an introductory series of publications on the Intel Data Analytics Acceleration Library (DAAL). DAAL is a data analytics library optimized for modern highly parallel computer architectures such as Intel Xeon and Intel Xeon Phi processors. The goal of this series is to provide developers a technical overview for developing applications using DAAL. In this paper we focus on two aspects of developing an application with Intel DAAL: data management and computation. As a practical example, we implement a simple machine learning application with polynomial regression using the library in the batch computation mode. We demonstrate using this application for data-based prediction of hydrodynamics properties of yachts. The source code and data for the sample application are available for free download. The second and third part of the series will discuss other aspects of data analysis with DAAL. In part 2, we discuss distributed data and computation in conjunction with MPI. In the third part, we discuss the case with multiple data sets and interfacing with a [...]

Optimization Techniques for the Intel MIC Architecture. Part 3 of 3: False Sharing and Padding

August 8, 2015

This is part 3 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we discuss false sharing, highlighting the situations in which it may occur, and eliminating it with the help of data container padding. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Similar workloads occur in Monte Carlo simulations, particle physics software, and statistical analysis. Results show that the impact of false sharing may be as high as an order of magnitude performance loss in a parallel application. On Intel Xeon processors, padding required to eliminate false sharing is greater than on Intel Xeon Phi coprocessors, so target-specific padding values may be used in real-life applications. See also: Part 1: Multi-Threading and Parallel Reduction Part 2: Strip-Mining for Vectorization Part 3: False Sharing and Padding Complete paper: [...]

Software Developer’s Introduction to the HGST Ultrastar Archive Ha10 SMR Drives

July 31, 2015

In this paper we will discuss the new HGST Shingled Magnetic Recording (SMR) drives, Ultrastar Archive Ha10, which offers storage capacities of 10 TB and beyond. With their high-density storage capacities, these drives are well suited for large “active archive” applications. In an active archive application, the data is frequently read but seldom modified. The SMR drives are host managed, meaning that the developer must manage the data storage on the drives. In this publication we introduce an open source library, libzbc, which was developed by the HGST team to assist developers who use SMR drives. The discussions cover topics from the very basics like opening a device, to more advanced topics like data padding. The goal of this paper is to give readers the necessary knowledge and tools to develop applications with libzbc. We will present an example, and then report several benchmarks of I/O operations on the HGST SMR drives, and discuss the SMR drive’s effectiveness as an active archive solution. Complete paper:  HGST_Introduction_to_libzbc.pdf (361 KB) — this [...]

Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization

June 26, 2015

This is part 2 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we discuss data parallelism. Our focus is automatic vectorization and exposing vectorization opportunities to the compiler. For a practical illustration, we construct and optimize a micro-kernel for particle binning particles. Similar workloads occur applications in Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to code vectorization, which results in an order of magnitude performance improvement on an Intel Xeon processor. Performance on Xeon Phi compared to that on a high-end Xeon is 1.4x greater in single precision and 1.6x greater in double precision. See also: Part 1: Multi-Threading and Parallel Reduction Part 2: Strip-Mining for Vectorization Part 3: False Sharing and Padding Complete paper:  Colfax_Optimization_Techniques_2_of_3.pdf (650 KB) [...]

Optimization Techniques for the Intel MIC Architecture. Part 1 of 3: Multi-Threading and Parallel Reduction

May 29, 2015

This is part 1 of a 3-part educational series of publications introducing select topics on optimization of applications for the Intel multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we focus on thread parallelism and race conditions. We discuss the usage of mutexes in OpenMP to resolve race conditions. We also show how to implement efficient parallel reduction using thread-private storage and mutexes. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Workloads like this one occur in such applications as Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to a performance increase of 25x on a 24-core CPU and up to 100x on the MIC architecture compared to a single-threaded implementation on the same architectures. In the next publication of this series, we will demonstrate further optimization of this workload, focusing on vectorization. See also: Part 1: Multi-Threading [...]

Performance to Power and Performance to Cost Ratios with Intel Xeon Phi Coprocessors (and why 1x Acceleration May Be Enough)

January 27, 2015

The paper studies two performance metrics of systems enabled with Intel Xeon Phi coprocessors: the ratio of performance to consumed electrical power and the ratio of performance to purchasing system cost, both under the assumption of linear parallel scalability of the application. Performance to power values are measured for three workloads: a compute-bound workload (DGEMM), a memory bandwidth-bound workload (STREAM), and a latency-limited workload (small matrix LU decomposition). Performance to cost ratios are computed, using system configurations and prices available at Colfax International, as functions of the acceleration factor and of the number of coprocessors per system. That study considers hypothetical applications with acceleration factor from 0.35x to 2x. In all studies, systems with Intel Xeon Phi coprocessors yield better metrics than systems with only Intel Xeon processors. That applies even with acceleration factor of 1x, as long as the application can be distributed between the CPU and the coprocessor. Complete paper:  Colfax_1x.pdf (321 KB) — this file is [...]

Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: LU Decomposition of Small Matrices

January 27, 2015

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel Xeon Phi coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation. In addition, the loop tiling technique of memory traffic tuning is shown. The optimization methods are illustrated on an example of single-threaded LU decomposition of a single precision matrix of size 128×128. Benchmarks show that the discussed optimizations improve the performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor. The code discussed in the paper can be freely downloaded from this page. Complete paper:  Colfax_LU.pdf (604 KB) — this file is available only to registered users. Register or Log In. Source code for Linux: colfax-lu.tgz (17 KB) — this archive is available only to registered users. Register or [...]

Intel Cilk Plus for Complex Parallel Algorithms: “Enormous Fast Fourier Transforms” (EFFT) Library

September 18, 2014

In this paper we demonstrate the methodology for parallelizing the computation of large one-dimensional discrete fast Fourier transforms (DFFTs) on multi-core Intel Xeon processors. DFFTs based on the recursive Cooley-Tukey method have to control cache utilization, memory bandwidth and vector hardware usage, and at the same time scale across multiple threads or compute nodes. Our method builds on single-threaded Intel Math Kernel Library (MKL) implementation of DFFT, and uses the Intel Cilk Plus framework for thread parallelism. We demonstrate the ability of Intel Cilk Plus to handle parallel recursion with nested loop-centric parallelism without tuning the code to the number of cores or cache metrics. The result of our work is a library called EFFT that performs 1D DFTs of size 2^N for N>=21 faster than the corresponding Intel MKL parallel DFT implementation by up to 1.5x, and faster than FFTW by up to 2.5x. The code of EFFT is available for free download under the GPLv3 license. This work provides a new efficient DFFT implementation, and at the same time demonstrates an [...]

Installing Intel MPSS 3.3 in Arch Linux

August 20, 2014

This technical publication provides instructions for installing the Intel Manycore Platform Software Stack (MPSS) version 3.3 in Arch Linux operating system. Intel MPSS is a suite of tools necessary for operation of Intel Xeon Phi coprocessors. Instructions provided here enable offload and networking functionality for coprocessors in Arch Linux. The procedure described in this paper is completely reversible via an uninstallation script. Downloads: Product Direct Link Intel MPSS 3.3 (page, archive) mpss-3.3-linux.tar (~400 MB) Linux Kernel 3.10 LTS (AUR) linux-lts310.tar.gz (78 KB) TRee Installation Generator (TRIG) trig.sh (3 KB) — this archive is available only to registered users. Register or Log In. RHEL networking utilities rhnet.tgz (34 KB) — this archive is available only to registered users. Register or Log In. Offload functionality test Offload-Hello.cc (347 B) — this archive is available only to registered users. Register or Log In. GNU Public License v2 (applies to TRIG and RHEL utilities) page Paper:  Colfax_MPSS_in_Arch_Linux.pdf (97 KB) — [...]
1 2 3 4