Publications

Machine Learning on 2nd Generation Intel® Xeon Phi™ Processors: Image Captioning with NeuralTalk2, Torch

June 20, 2016

In this case study, we describe a proof-of-concept implementation of a highly optimized machine learning application for Intel Architecture. Our results demonstrate the capabilities of Intel Architecture, particularly the 2nd generation Intel Xeon Phi processors (formerly codenamed Knights Landing), in the machine learning domain. Download as PDF: Colfax-NeuralTalk2-Summary.pdf (814 KB) or read online below. Code: see our branch of NeuralTalk2 for instructions on reproducing our results (in Readme.md). It uses our optimized branch of Torch to run efficiently on Intel architecture. See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ 1. Case Study It is common in the machine learning (ML) domain to see applications implemented with the use of frameworks and libraries such as Torch, Caffe, TensorFlow, and similar. This approach allows the computer scientist to focus on the learning algorithm, leaving the details of performance optimization to the framework. Similarly, the ML frameworks usually rely on a third-party library such as Atlas, CuBLAS, [...]

Intel® Python* on 2nd Generation Intel® Xeon Phi™ Processors: Out-of-the-Box Performance

June 20, 2016

This paper reports on the value and performance for computational applications of the Intel® distribution for Python* 2017 Beta on 2nd generation Intel® Xeon Phi™ processors (formerly codenamed Knights Landing). Benchmarks of LU decomposition, Cholesky decomposition, singular value decomposition and double precision general matrix-matrix multiplication routines in the SciPy and NumPy libraries are presented, and tuning methodology for use with high-bandwidth memory (HBM) is laid out. Download as PDF: Colfax-Intel-Python.pdf (1 MB) or read online below. Code: coming soon, check back later. See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ 1. A Case for Python in Computing Python is a popular scripting language in computational applications. Empowered with the fundamental tools for scientific computing, NumPy and SciPy libraries, Python applications can express in brief and convenient form basic linear algebra subroutines (BLAS) and linear algebra package (LAPACK) functions for operations on matrices and systems of linear algebraic [...]

Get Ready for Intel’s Knights Landing (KNL) – 3 papers

May 11, 2016

2nd generation Intel Xeon Phi processors code-named Knights Landing (KNL) are expected to provide up to 3X higher performance than the 1st generation. With on-board high-bandwidth memory and optional integrated high-speed fabric—plus the availability of socket form-factor — these powerful components will transform the fundamental building block of technical computing. Download three essential publications on new features in Knights Landing Processors: Automatic Vectorization with Intel AVX-512 Instructions in KNL In this document, we focus on the new vector instruction set introduced in Knights Landing processors, Intel® Advanced Vector Extensions 512 (Intel® AVX-512). The discussion includes: Introduction to vector instructions in general, The structure and specifics of AVX-512, and Practical usage tips: checking if a processor has support for various features, compilation process and compiler arguments, pros and cons of explicit and automatic vectorization, using the Intel® C++ Compiler and the GNU Compiler Collection. Download PDF Read Online Clustering Modes in Knights [...]

Clustering Modes in Knights Landing Processors

May 11, 2016

This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™ processors code-named Knights Landing (KNL). In this document we discuss the clustering modes of the on-die mesh interconnect. We start a discussion on what types of applications benefit from the clustering modes and why clustering modes help these applications. After that we cover the specifics of the available cluster modes: all-to-all, quadrant, hemisphere, SNC-4 and SNC-2. Finally, we discuss how to make the application NUMA-aware for use in SNC modes. In this context, we give recipes for nested OpenMP and hybrid MPI+OpenMP approaches combined with first-touch allocation policy, numactl tool and memkind library. Colfax_KNL_Clustering_Modes_Guide.pdf (376 KB) See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ Table of Contents 1. Cache Organization in KNL 2. Clustering Modes 2.1. All-to-All 2.2. Quadrant/Hemisphere 2.3. SNC-4/SNC-2 2.4. Setting the Clustering Mode 3. Programming with Sub-NUMA Clusters 3.1. Querying NUMA Information [...]

MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors: Developer’s Guide

May 11, 2016

This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™ processors code-named Knights Landing (KNL). In this document we discuss the on-package high-bandwidth memory (HBM) based on the multi-channel dynamic random access memory (MCDRAM) technology: Three configuration modes of HBM: Flat mode, Cache mode and Hybrid mode Utilization of the HBM as addressable memory using two methods: by setting affinity policy with the numactl tool and through the usage of special allocators in the memkind library Guidelines for determining the optimal usage model for applications running on bootable Knights Landing. Colfax_KNL_MCDRAM_Guide.pdf (255 KB) See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ Table of Contents 1. MCDRAM in KNL 2.1. Cache Mode 2.2. Flat Mode 2.3. Hybrid Mode 3. Using HBM as addressable memory 3.1. numactl 3.2. Memkind Library 3.3. Fortran 4. Choosing Memory and Programming Model 4.1. Programming with HBM… 4.2. …and Programming without HBM Appendix A: Application Memory [...]

Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing Processors

May 11, 2016

This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™processors code-named Knights Landing (KNL). In this document, we focus on the new vector instruction set introduced in Knights Landing processors, Intel® Advanced Vector Extensions 512 (Intel® AVX-512). The discussion includes: Introduction to vector instructions in general, The structure and specifics of AVX-512, and Practical usage tips: checking if a processor has support for various features, compilation process and compiler arguments, and pros and cons of explicit and automatic vectorization using the Intel® C++ Compiler and the GNU Compiler Collection. Colfax_KNL_AVX512_Guide.pdf () See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ Table of Contents 1. Vector Instructions 2. Structure and Functionality of AVX-512 2.1. Subsets 2.2. AVX512-F 2.3. AVX512-CD 2.4. AVX512-ER 2.5. AVX512-PF 3. Feature Check 3.1. Command Line 3.2. Source Code 4. Compiling 4.1 Usage Models 4.2. Intel C++ Compiler 4.3. The GNU [...]

Guided Code Vectorization with Intel® Advisor XE

April 12, 2016

Early Stage Application Optimization made Easy with Step-by-Step Guide In this publication we discuss the usage of an optimization tool called Intel® Advisor. The discussion is illustrated with an example workload that computes the electric potential in a set of points in 3-D space produced by a group of charged particles. The example workload runs on a multi-core Intel Xeon processor with Intel AVX2 instructions. The application was originally parallelized across cores, but otherwise neither optimized nor vectorized. In the publication, we discuss three performance issues that the Intel Advisor detected: vector dependence, type conversion and inefficient memory access pattern. For each issue, we discuss how to interpret the data presented by the Intel Advisor, and also how to to optimize the application to resolve these issues. After the optimization, we observed a 16x performance boost compared to the original, non-optimized implementation. Complete paper: Colfax_Advisor_Vectorization.pdf (1 MB) Sample code for Linux: Colfax_Advisor_Vectorization_Code.zip (50 [...]

Introduction to Intel DAAL, Part 2: Distributed Variance-Covariance Matrix Computation

March 28, 2016

This is the part 2 of 3 of an introductory series of publications on the Intel® Data Analytics Acceleration Library (DAAL). DAAL is a data analytics library optimized for modern highly parallel computer architectures such as Intel Xeon and Intel Xeon Phi processors. The goal of this series is to provide developers a technical overview for developing applications using DAAL. In part 1 of the series we discussed how to implement batch mode computation on a single node. In the present publication, we discuss the distributed mode computation. Our discussion will focus both on how and when to implement distributed mode computation with Intel DAAL. As an example workload, we implement an application that uses DAAL to compute a covariance matrix of a set of vectors. We first demonstrate how to use distributed mode with this example. Then, using this example application, we scan the parameter space to determine what parameter ranges benefit from distributed computation. We also demonstrate how the output of this computation may be used in image processing to compute the eigenvectors of [...]

Introduction to Intel DAAL, Part 1: Polynomial Regression with Batch Mode Computation

October 28, 2015

This is the part 1 of 3 of an introductory series of publications on the Intel Data Analytics Acceleration Library (DAAL). DAAL is a data analytics library optimized for modern highly parallel computer architectures such as Intel Xeon and Intel Xeon Phi processors. The goal of this series is to provide developers a technical overview for developing applications using DAAL. In this paper we focus on two aspects of developing an application with Intel DAAL: data management and computation. As a practical example, we implement a simple machine learning application with polynomial regression using the library in the batch computation mode. We demonstrate using this application for data-based prediction of hydrodynamics properties of yachts. The source code and data for the sample application are available for free download. The second and third part of the series will discuss other aspects of data analysis with DAAL. In part 2, we discuss distributed data and computation in conjunction with MPI. In the third part, we discuss the case with multiple data sets and interfacing with a [...]

Optimization Techniques for the Intel MIC Architecture. Part 3 of 3: False Sharing and Padding

August 8, 2015

This is part 3 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we discuss false sharing, highlighting the situations in which it may occur, and eliminating it with the help of data container padding. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Similar workloads occur in Monte Carlo simulations, particle physics software, and statistical analysis. Results show that the impact of false sharing may be as high as an order of magnitude performance loss in a parallel application. On Intel Xeon processors, padding required to eliminate false sharing is greater than on Intel Xeon Phi coprocessors, so target-specific padding values may be used in real-life applications. See also: Part 1: Multi-Threading and Parallel Reduction Part 2: Strip-Mining for Vectorization Part 3: False Sharing and Padding Complete paper: [...]

« 1 2 3 4 5 »