Recent

Optimization of Real-Time Object Detection on Intel® Xeon® Scalable Processors

November 11, 2017

This publication demonstrates the process of optimizing an object detection inference workload on and Intel® Xeon® Scalable processor using TensorFlow. This project pursues two objectives: Achieve object detection with real-time throughput (frame rate) and low latency Minimize the required computational resources In this case study, a model described in the “You Only Look Once” (YOLO) project is used for object detection. The model consists of two components: a convolutional neural network and a post-processing pipeline. In this work, the original Darknet model is converted to a TensorFlow model. First, the convolutional neural network is optimized for inference. Then array programming with NumPy and TensorFlow is implemented for the post-processing pipeline. Finally, environment variables and other configuration parameters are tuned to maximize the computational performance. With these optimizations, real-time object detection is achieved while using a fraction of the available processing power of an Intel Xeon Scalable processor-based system. [...]

A Performance-Based Comparison of C/C++ Compilers

November 11, 2017

This paper reports a performance-based comparison of six state-of-the-art C/C++ compilers: AOCC, Clang, G++, Intel C++ compiler, PGC++, and Zapcc. We measure two aspects of the compilers’ performance: The speed of compiled C/C++ code parallelized with OpenMP 4.x directives for multi-threading and vectorization. The compilation time for large projects with heavy C++ templating. In addition to measuring the performance, we interpret the results by examining the assembly instructions produced by each compiler. The tests are performed on an Intel Xeon Platinum processor featuring the Skylake architecture with AVX-512 vector instructions.  Colfax_Compiler_Comparison.pdf (562 KB) — this file is available only to registered users. Register or Log In. Table of Contents 1. The Importance of a Good Compiler 2. Testing Methodology 2.1. Meet the Compilers 2.2. Target Architecture 2.3. Computational Kernels 2.4. Compilation Time 2.5. Test Details 2.6. Test Platform 2.7. Code Analysis 3. Results 3.1. Performance of Compiled Code 3.2. Compilation Speed 4. Summary Appendix A. LU [...]

A Survey and Benchmarks of Intel® Xeon® Gold and Platinum Processors

November 7, 2017

This paper provides quantitative guidelines and performance estimates for choosing a processor among the Platinum and Gold groups of the Intel Xeon Scalable family (formerly Skylake). The performance estimates are based on detailed technical specifications of the processors, including the efficiency of the Intel Turbo Boost technology. The achievable performance metrics are experimentally validated on several processor models with synthetic workloads. The best choice of the processor must take into account the nature of the application for which the processor is intended: multi-threading or multi-processing efficiency, support for vectorization, and dependence on memory bandwidth.  Colfax-Xeon-Scalable.pdf (334 KB) — this file is available only to registered users. Register or Log In. Table of Contents 1. Which Xeon is Right for You? 2. CPU Comparison for Different Workloads 3. Processor Choice Recommendations 4. Silver and Bronze Models 5. Large Memory, Integrated Fabric, Thermal Optimization Why are some sections grayed out? You are viewing an abridged version of this [...]

Capabilities of Intel® AVX-512 in Intel® Xeon® Scalable Processors (Skylake)

September 19, 2017

This paper reviews the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set and answers two critical questions: How do Intel® Xeon® Scalable processors based on the Skylake architecture (2017) compare to their predecessors based on Broadwell due to AVX-512? How are Intel Xeon processors based on Skylake different from their alternative, Intel® Xeon Phi™ processors with the Knights Landing architecture, which also feature AVX-512? We address these questions from the programmer’s perspective by demonstrating C language code of microkernels benefitting from AVX-512. For each example, we dig deeper and analyze the compilation practices, resultant assembly, and optimization reports. In addition to code studies, the paper contains performance measurements for a synthetic benchmark with guidelines on estimating peak performance. In conclusion, we outline the workloads and application domains that can benefit from the new features of AVX-512 instructions.  Colfax-SKL-AVX512-Guide.pdf (524 KB) — this file is available only to registered users. [...]

Optimization of Hamerly’s K-Means Clustering Algorithm: CFXKMeans Library

July 21, 2017

This publication describes the application of performance optimizations techniques to Hamerly’s K-means clustering algorithm. Starting with an unoptimized implementation of the algorithm, we discuss: Thread scheduling Reduction patterns SIMD reduction Unroll and jam Presented optimizations aggregate to 85.6x speedup compared to the original unoptimized implementation. Resulting implementation is packaged into a library named CFXKMeans with interfaces for C/C++ and Python. The Python interface is benchmarked using the MNIST 784 data set. The result for K=64 is compared to the performance of K-means clustering implementation in a popular machine learning framework, scikit-learn, from the Intel distribution for Python. CFXKMeans performed our benchmark tests faster than scikit-learn by a factor of 4.68x on an Intel Xeon processor E5-2699 v4 and 5.54x on an Intel Xeon Phi 7250 processor. The CFXKMeans library has C/C++ and Python API and is available under the MIT license at https://github.com/ColfaxResearch/CFXKMeans.  Colfax-Kmeans-Clustering-Optimization.pdf (365 KB) [...]

HPLinpack Benchmark on Intel Xeon Phi Processor Family x200 with Intel Omni-Path Fabric 100

July 10, 2017

We report the performance and a simplified tuning methodology of the HPLinpack benchmark on a cluster of Intel Xeon Phi processors 7250 with an Intel Omni-Path Fabric 100 Series interconnect. Our benchmarks are taken on the Colfax Cluster, a state-of-the-art computing resource open to the public for benchmarking and code validation. The paper provides recipes that may be used to reproduce our results in environments similar to this cluster.  Colfax-HPL-Intel-Xeon-Phi-x200-and-Intel-Omni-Path-100.pdf (130 KB) — this file is available only to registered users. Register or Log In. Table of Contents Section 1. HPLinpack Benchmark Algorithm HPL Configuration File Section 2. System Configuration Intel Architecture Colfax Cluster Section 3. Results Recipe Performance Impact of System Configuration Section 4. Summary Section 1. HPLinpack Benchmark The HPLinpack benchmark generates and solves on distributed-memory computers a large dense system of linear algebraic equations with random coefficients. The benchmark exercises the floating-point arithmetic units, the memory subsystem, [...]