Publications

FALCON Library: Fast Image Convolution in Neural Networks on Intel Architecture

November 9, 2016

We describe FALCON, an original open-source implementation of image convolution with a 3×3 filter based on Winograd’s minimal filtering algorithm. Compared to direct convolution, Winograd’s algorithm reduces the number of arithmetic operations at the cost of complicating the memory access pattern. This study is carried out in the context of image analysis in convolutional neural networks. Our implementation combines C language code with BLAS function calls for general matrix-matrix multiplication. The code is optimized for Intel Xeon Phi processors x200 (formerly Knights Landing) with Intel Math Kernel Library (MKL) used for BLAS call to the SGEMM function. To test the performance of FALCON in the context of machine learning, we benchmarked it for a set of image and filter sizes corresponding to the VGG Net architecture. In this test, FALCON achieves 10% greater overall performance than convolution from DNN primitives in Intel MKL. However, for some layers, FALCON is faster than MKL by 1.5x, but for other layers slower by as much as 4x. This indicates a possibility of a [...]

Machine Learning on 2nd Generation Intel® Xeon Phi™ Processors: Image Captioning with NeuralTalk2, Torch

June 20, 2016

  In this case study, we describe a proof-of-concept implementation of a highly optimized machine learning application for Intel Architecture. Our results demonstrate the capabilities of Intel Architecture, particularly the 2nd generation Intel Xeon Phi processors (formerly codenamed Knights Landing), in the machine learning domain. Download as PDF:  Colfax-NeuralTalk2-Summary.pdf (814 KB) — this file is available only to registered users. Register or Log In. or read online below. Code: see our branch of NeuralTalk2 for instructions on reproducing our results (in Readme.md). It uses our optimized branch of Torch to run efficiently on Intel architecture. See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ 1. Case Study It is common in the machine learning (ML) domain to see applications implemented with the use of frameworks and libraries such as Torch, Caffe, TensorFlow, and similar. This approach allows the computer scientist to focus on the learning algorithm, leaving the details of performance optimization to the framework. Similarly, the ML [...]

Intel® Python* on 2nd Generation Intel® Xeon Phi™ Processors: Out-of-the-Box Performance

June 20, 2016

This paper reports on the value and performance for computational applications of the Intel® distribution for Python* 2017 Beta on 2nd generation Intel® Xeon Phi™ processors (formerly codenamed Knights Landing). Benchmarks of LU decomposition, Cholesky decomposition, singular value decomposition and double precision general matrix-matrix multiplication routines in the SciPy and NumPy libraries are presented, and tuning methodology for use with high-bandwidth memory (HBM) is laid out. Download as PDF:  Colfax-Intel-Python.pdf (1 MB) — this file is available only to registered users. Register or Log In. or read online below. Code: coming soon, check back later. See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ 1. A Case for Python in Computing Python is a popular scripting language in computational applications. Empowered with the fundamental tools for scientific computing, NumPy and SciPy libraries, Python applications can express in brief and convenient form basic linear algebra subroutines (BLAS) and linear algebra package (LAPACK) [...]

Get Ready for Intel’s Knights Landing (KNL) – 3 papers

May 11, 2016

2nd generation Intel Xeon Phi processors code-named Knights Landing (KNL) are expected to provide up to 3X higher performance than the 1st generation. With on-board high-bandwidth memory and optional integrated high-speed fabric—plus the availability of socket form-factor — these powerful components will transform the fundamental building block of technical computing. Download three essential publications on new features in Knights Landing Processors: Automatic Vectorization with Intel AVX-512 Instructions in KNL In this document, we focus on the new vector instruction set introduced in Knights Landing processors, Intel® Advanced Vector Extensions 512 (Intel® AVX-512). The discussion includes: Introduction to vector instructions in general, The structure and specifics of AVX-512, and Practical usage tips: checking if a processor has support for various features, compilation process and compiler arguments, pros and cons of explicit and automatic vectorization, using the Intel® C++ Compiler and the GNU Compiler Collection. Download PDF Read Online Clustering Modes in Knights [...]

Clustering Modes in Knights Landing Processors

May 11, 2016

This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™ processors code-named Knights Landing (KNL). In this document we discuss the clustering modes of the on-die mesh interconnect. We start a discussion on what types of applications benefit from the clustering modes and why clustering modes help these applications. After that we cover the specifics of the available cluster modes: all-to-all, quadrant, hemisphere, SNC-4 and SNC-2. Finally, we discuss how to make the application NUMA-aware for use in SNC modes. In this context, we give recipes for nested OpenMP and hybrid MPI+OpenMP approaches combined with first-touch allocation policy, numactl tool and memkind library.  Colfax_KNL_Clustering_Modes_Guide.pdf (376 KB) — this file is available only to registered users. Register or Log In. See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ 1. Cache Organization in KNL 2nd generation Intel® Xeon Phi™processors code-named Knights Landing (KNL) are specialized [...]

MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors: Developer’s Guide

May 11, 2016

This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™ processors code-named Knights Landing (KNL). In this document we discuss the on-package high-bandwidth memory (HBM) based on the multi-channel dynamic random access memory (MCDRAM) technology: Three configuration modes of HBM: Flat mode, Cache mode and Hybrid mode Utilization of the HBM as addressable memory using two methods: by setting affinity policy with the numactl tool and through the usage of special allocators in the memkind library Guidelines for determining the optimal usage model for applications running on bootable Knights Landing.  Colfax_KNL_MCDRAM_Guide.pdf (255 KB) — this file is available only to registered users. Register or Log In. See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ 1. MCDRAM in KNL Memory bandwidth in computing systems is one of the common bottlenecks for performance in computational application. Bandwidth-limited applications are characterized by algorithms that have few floating point [...]

Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing Processors

May 11, 2016

This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™processors code-named Knights Landing (KNL). In this document, we focus on the new vector instruction set introduced in Knights Landing processors, Intel® Advanced Vector Extensions 512 (Intel® AVX-512). The discussion includes: Introduction to vector instructions in general, The structure and specifics of AVX-512, and Practical usage tips: checking if a processor has support for various features, compilation process and compiler arguments, and pros and cons of explicit and automatic vectorization using the Intel® C++ Compiler and the GNU Compiler Collection.  Colfax_KNL_AVX512_Guide.pdf (195 KB) — this file is available only to registered users. Register or Log In. See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ 1. Vector Instructions Intel® Xeon Phi™products are highly parallel processors with the Intel® Many Integrated Core (MIC) architecture. Parallelism is present in these [...]

Guided Code Vectorization with Intel® Advisor XE

April 12, 2016

Early Stage Application Optimization made Easy with Step-by-Step Guide In this publication we discuss the usage of an optimization tool called Intel® Advisor. The discussion is illustrated with an example workload that computes the electric potential in a set of points in 3-D space produced by a group of charged particles. The example workload runs on a multi-core Intel Xeon processor with Intel AVX2 instructions. The application was originally parallelized across cores, but otherwise neither optimized nor vectorized. In the publication, we discuss three performance issues that the Intel Advisor detected: vector dependence, type conversion and inefficient memory access pattern. For each issue, we discuss how to interpret the data presented by the Intel Advisor, and also how to to optimize the application to resolve these issues. After the optimization, we observed a 16x performance boost compared to the original, non-optimized implementation. Complete paper:  Colfax_Advisor_Vectorization.pdf (1 MB) — this file is available only to registered users. Register or Log In. [...]

Introduction to Intel DAAL, Part 2: Distributed Variance-Covariance Matrix Computation

March 28, 2016

This is the part 2 of 3 of an introductory series of publications on the Intel® Data Analytics Acceleration Library (DAAL). DAAL is a data analytics library optimized for modern highly parallel computer architectures such as Intel Xeon and Intel Xeon Phi processors. The goal of this series is to provide developers a technical overview for developing applications using DAAL. In part 1 of the series we discussed how to implement batch mode computation on a single node. In the present publication, we discuss the distributed mode computation. Our discussion will focus both on how and when to implement distributed mode computation with Intel DAAL. As an example workload, we implement an application that uses DAAL to compute a covariance matrix of a set of vectors. We first demonstrate how to use distributed mode with this example. Then, using this example application, we scan the parameter space to determine what parameter ranges benefit from distributed computation. We also demonstrate how the output of this computation may be used in image processing to compute the eigenvectors of [...]

Introduction to Intel DAAL, Part 1: Polynomial Regression with Batch Mode Computation

October 28, 2015

This is the part 1 of 3 of an introductory series of publications on the Intel Data Analytics Acceleration Library (DAAL). DAAL is a data analytics library optimized for modern highly parallel computer architectures such as Intel Xeon and Intel Xeon Phi processors. The goal of this series is to provide developers a technical overview for developing applications using DAAL. In this paper we focus on two aspects of developing an application with Intel DAAL: data management and computation. As a practical example, we implement a simple machine learning application with polynomial regression using the library in the batch computation mode. We demonstrate using this application for data-based prediction of hydrodynamics properties of yachts. The source code and data for the sample application are available for free download. The second and third part of the series will discuss other aspects of data analysis with DAAL. In part 2, we discuss distributed data and computation in conjunction with MPI. In the third part, we discuss the case with multiple data sets and interfacing with a [...]
1 2 3 4