Publications

Canonical Stratification for Non-Mathematicians

October 17, 2018

Our recent publication “Algorithmic Canonical Stratifications of Simplicial Complexes” proposes a new algorithm for data analysis that offers a topology-aware path towards explainable artificial intelligence. Despite (or, perhaps, due to) being mathematically rigorous, the text of the original work is virtually impenetrable for readers not familiar with the concepts, tools, and notation of topology. In order to convey our ideas to a wider audience, we present this supplemental introduction. Here, we summarize and explain in plain English the motivation, reasoning, and methods of our new topological data analysis algorithm that we term “canonical stratification”. Canonical-Stratification-for-Non-Mathematicians.pdf (38 KB) Table of Contents 1. Motivation 2. More on Canonical Stratification 3. Conclusion 1. Motivation Machine learning has advanced significantly in recent years and has proven itself to be a powerful and versatile tool in a variety of data-driven disciplines. Machine learning algorithms are now being used to make decisions in numerous areas [...]

How New QLC SATA SSDs Deliver 8x Faster Machine Learning

October 9, 2018

We record performance measurements on Micron 5210 SSD related to Machine Learning workflow. Even though Machine Learning is highly CPU intensive, fast storage can lower training time through faster file pre-processing and serialization, particularly when the size of a data set exceeds the amount of installed memory. A popular format for datasets is TFRecord, and in our performance measurements, we will be comparing the throughput speed and completion time of a TFRecord on a 7.68TB Micron 5210 ION SSD versus that of an 8TB Seagate 7200RPM HDD. Colfax-Machine-Learning-and-QLC-SSDs.pdf (151 KB) Table of Contents 1. QLC SSDs 2. Micron QLC SSD 3. Test System Configuration 4. Test Workload: TFRecord 5. Test Results 6. Summary 1. QLC SSDs For years, 7200 RPM hard disk drives (HDDs) have been the standard media on which Machine Learning (ML) training data sets have been stored. These traditional HDDs have been preferred due to their low cost and easy to adopt SATA interfaces. However, HDD’s suffer from relatively slow throughput . Solid State Drives (SSDs) have been too [...]

An optimization approach for agent-based computational models of biological development

April 9, 2018

Pablo Gonzalez-de-Aledoa, Andrey Vladimirovd, Marco Mancab, Jerry Baughc, Ryo Asaid, Marcus Kaisere,f, Roman Bauerf,e a Software Performance Optimization Group, Imperial College London, London, United Kingdom b CERN Openlab, IT Department, CERN, Switzerland c Intel Corporation, USA d Colfax International, USA e Interdisciplinary Computing and Complex BioSystems Research Group, School of Computing, Newcastle University, Newcastle upon Tyne, United Kingdom f Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, United Kingdom A paper led by Pablo Gonzales-de-Aledo (Imperial College London) with contributions from his colleagues from CERN, Intel, Colfax and Newcastle University was published in the journal Advances in Engineering Software. This is a case study on performance optimization in a biological simulation code. The code presents a highly parallel implementation of a computer simulation that involves millions of agents interacting in a 3D environment. The paper explains the general approach to transforming a sequential code to run on modern, highly [...]

Optimization of Real-Time Object Detection on Intel® Xeon® Scalable Processors

November 11, 2017

This publication demonstrates the process of optimizing an object detection inference workload on an Intel® Xeon® Scalable processor using TensorFlow. This project pursues two objectives: Achieve object detection with real-time throughput (frame rate) and low latency Minimize the required computational resources In this case study, a model described in the “You Only Look Once” (YOLO) project is used for object detection. The model consists of two components: a convolutional neural network and a post-processing pipeline. In this work, the original Darknet model is converted to a TensorFlow model. First, the convolutional neural network is optimized for inference. Then array programming with NumPy and TensorFlow is implemented for the post-processing pipeline. Finally, environment variables and other configuration parameters are tuned to maximize the computational performance. With these optimizations, real-time object detection is achieved while using a fraction of the available processing power of an Intel Xeon Scalable processor-based system. [...]

A Performance-Based Comparison of C/C++ Compilers

November 11, 2017

This paper reports a performance-based comparison of six state-of-the-art C/C++ compilers: AOCC, Clang, G++, Intel C++ compiler, PGC++, and Zapcc. We measure two aspects of the compilers’ performance: The speed of compiled C/C++ code parallelized with OpenMP 4.x directives for multi-threading and vectorization. The compilation time for large projects with heavy C++ templating. In addition to measuring the performance, we interpret the results by examining the assembly instructions produced by each compiler. The tests are performed on an Intel Xeon Platinum processor featuring the Skylake architecture with AVX-512 vector instructions. Colfax_Compiler_Comparison.pdf (562 KB) Table of Contents 1. The Importance of a Good Compiler 2. Testing Methodology 2.1. Meet the Compilers 2.2. Target Architecture 2.3. Computational Kernels 2.4. Compilation Time 2.5. Test Details 2.6. Test Platform 2.7. Code Analysis 3. Results 3.1. Performance of Compiled Code 3.2. Compilation Speed 4. Summary Appendix A. LU Decomposition Appendix B. Jacobi Solver Appendix C. Structure Function Appendix [...]

A Survey and Benchmarks of Intel® Xeon® Gold and Platinum Processors

November 7, 2017

This paper provides quantitative guidelines and performance estimates for choosing a processor among the Platinum and Gold groups of the Intel Xeon Scalable family (formerly Skylake). The performance estimates are based on detailed technical specifications of the processors, including the efficiency of the Intel Turbo Boost technology. The achievable performance metrics are experimentally validated on several processor models with synthetic workloads. The best choice of the processor must take into account the nature of the application for which the processor is intended: multi-threading or multi-processing efficiency, support for vectorization, and dependence on memory bandwidth. Colfax-Xeon-Scalable.pdf (334 KB) Table of Contents 1. Which Xeon is Right for You? 2. CPU Comparison for Different Workloads 2.4. Bandwidth-Limited 3. Processor Choice Recommendations 4. Silver and Bronze Models 5. Large Memory, Integrated Fabric, Thermal Optimization 1. Which Xeon is Right for You? In 2017, the Intel Xeon Scalable processor family was released, featuring the Skylake architecture. [...]

Capabilities of Intel® AVX-512 in Intel® Xeon® Scalable Processors (Skylake)

September 19, 2017

This paper reviews the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set and answers two critical questions: How do Intel® Xeon® Scalable processors based on the Skylake architecture (2017) compare to their predecessors based on Broadwell due to AVX-512? How are Intel Xeon processors based on Skylake different from their alternative, Intel® Xeon Phi™ processors with the Knights Landing architecture, which also feature AVX-512? We address these questions from the programmer’s perspective by demonstrating C language code of microkernels benefitting from AVX-512. For each example, we dig deeper and analyze the compilation practices, resultant assembly, and optimization reports. In addition to code studies, the paper contains performance measurements for a synthetic benchmark with guidelines on estimating peak performance. In conclusion, we outline the workloads and application domains that can benefit from the new features of AVX-512 instructions. Colfax-SKL-AVX512-Guide.pdf (524 KB) Table of Contents 1. Intel Advanced Vector Extensions 512 [...]

Optimization of Hamerly’s K-Means Clustering Algorithm: CFXKMeans Library

July 21, 2017

This publication describes the application of performance optimizations techniques to Hamerly’s K-means clustering algorithm. Starting with an unoptimized implementation of the algorithm, we discuss: Thread scheduling Reduction patterns SIMD reduction Unroll and jam Presented optimizations aggregate to 85.6x speedup compared to the original unoptimized implementation. Resulting implementation is packaged into a library named CFXKMeans with interfaces for C/C++ and Python. The Python interface is benchmarked using the MNIST 784 data set. The result for K=64 is compared to the performance of K-means clustering implementation in a popular machine learning framework, scikit-learn, from the Intel distribution for Python. CFXKMeans performed our benchmark tests faster than scikit-learn by a factor of 4.68x on an Intel Xeon processor E5-2699 v4 and 5.54x on an Intel Xeon Phi 7250 processor. The CFXKMeans library has C/C++ and Python API and is available under the MIT license at https://github.com/ColfaxResearch/CFXKMeans. Colfax-Kmeans-Clustering-Optimization.pdf (365 KB) [...]

HPLinpack Benchmark on Intel Xeon Phi Processor Family x200 with Intel Omni-Path Fabric 100

July 10, 2017

We report the performance and a simplified tuning methodology of the HPLinpack benchmark on a cluster of Intel Xeon Phi processors 7250 with an Intel Omni-Path Fabric 100 Series interconnect. Our benchmarks are taken on the Colfax Cluster, a state-of-the-art computing resource open to the public for benchmarking and code validation. The paper provides recipes that may be used to reproduce our results in environments similar to this cluster. Colfax-HPL-Intel-Xeon-Phi-x200-and-Intel-Omni-Path-100.pdf (130 KB) Table of Contents Section 1. HPLinpack Benchmark Algorithm HPL Configuration File Section 2. System Configuration Intel Architecture Colfax Cluster Section 3. Results Recipe Performance Impact of System Configuration Section 4. Summary Section 1. HPLinpack Benchmark The HPLinpack benchmark generates and solves on distributed-memory computers a large dense system of linear algebraic equations with random coefficients. The benchmark exercises the floating-point arithmetic units, the memory subsystem, and the communication fabric. The result of the HPLinpack benchmark is based [...]

FALCON Library: Fast Image Convolution in Neural Networks on Intel Architecture

November 9, 2016

We describe FALCON, an original open-source implementation of image convolution with a 3×3 filter based on Winograd’s minimal filtering algorithm. Compared to direct convolution, Winograd’s algorithm reduces the number of arithmetic operations at the cost of complicating the memory access pattern. This study is carried out in the context of image analysis in convolutional neural networks. Our implementation combines C language code with BLAS function calls for general matrix-matrix multiplication. The code is optimized for Intel Xeon Phi processors x200 (formerly Knights Landing) with Intel Math Kernel Library (MKL) used for BLAS call to the SGEMM function. To test the performance of FALCON in the context of machine learning, we benchmarked it for a set of image and filter sizes corresponding to the VGG Net architecture. In this test, FALCON achieves 10% greater overall performance than convolution from DNN primitives in Intel MKL. However, for some layers, FALCON is faster than MKL by 1.5x, but for other layers slower by as much as 4x. This indicates a possibility of a [...]

1 2 3 … 5 »