A Performance-Based Comparison of C/C++ Compilers

November 11, 2017

This paper reports a performance-based comparison of six state-of-the-art C/C++ compilers: AOCC, Clang, G++, Intel C++ compiler, PGC++, and Zapcc. We measure two aspects of the compilers’ performance: The speed of compiled C/C++ code parallelized with OpenMP 4.x directives for multi-threading and vectorization. The compilation time for large projects with heavy C++ templating. In addition to measuring the performance, we interpret the results by examining the assembly instructions produced by each compiler. The tests are performed on an Intel Xeon Platinum processor featuring the Skylake architecture with AVX-512 vector instructions.  Colfax_Compiler_Comparison.pdf (562 KB) — this file is available only to registered users. Register or Log In. Table of Contents 1. The Importance of a Good Compiler 2. Testing Methodology 2.1. Meet the Compilers 2.2. Target Architecture 2.3. Computational Kernels 2.4. Compilation Time 2.5. Test Details 2.6. Test Platform 2.7. Code Analysis 3. Results 3.1. Performance of Compiled Code 3.2. Compilation Speed 4. Summary Appendix A. LU [...]

A Survey and Benchmarks of Intel® Xeon® Gold and Platinum Processors

November 7, 2017

This paper provides quantitative guidelines and performance estimates for choosing a processor among the Platinum and Gold groups of the Intel Xeon Scalable family (formerly Skylake). The performance estimates are based on detailed technical specifications of the processors, including the efficiency of the Intel Turbo Boost technology. The achievable performance metrics are experimentally validated on several processor models with synthetic workloads. The best choice of the processor must take into account the nature of the application for which the processor is intended: multi-threading or multi-processing efficiency, support for vectorization, and dependence on memory bandwidth.  Colfax-Xeon-Scalable.pdf (334 KB) — this file is available only to registered users. Register or Log In. Table of Contents 1. Which Xeon is Right for You? 2. CPU Comparison for Different Workloads 3. Processor Choice Recommendations 4. Silver and Bronze Models 5. Large Memory, Integrated Fabric, Thermal Optimization Why are some sections grayed out? You are viewing an abridged version of this [...]

Capabilities of Intel® AVX-512 in Intel® Xeon® Scalable Processors (Skylake)

September 19, 2017

This paper reviews the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set and answers two critical questions: How do Intel® Xeon® Scalable processors based on the Skylake architecture (2017) compare to their predecessors based on Broadwell due to AVX-512? How are Intel Xeon processors based on Skylake different from their alternative, Intel® Xeon Phi™ processors with the Knights Landing architecture, which also feature AVX-512? We address these questions from the programmer’s perspective by demonstrating C language code of microkernels benefitting from AVX-512. For each example, we dig deeper and analyze the compilation practices, resultant assembly, and optimization reports. In addition to code studies, the paper contains performance measurements for a synthetic benchmark with guidelines on estimating peak performance. In conclusion, we outline the workloads and application domains that can benefit from the new features of AVX-512 instructions.  Colfax-SKL-AVX512-Guide.pdf (524 KB) — this file is available only to registered users. [...]

HPLinpack Benchmark on Intel Xeon Phi Processor Family x200 with Intel Omni-Path Fabric 100

July 10, 2017

We report the performance and a simplified tuning methodology of the HPLinpack benchmark on a cluster of Intel Xeon Phi processors 7250 with an Intel Omni-Path Fabric 100 Series interconnect. Our benchmarks are taken on the Colfax Cluster, a state-of-the-art computing resource open to the public for benchmarking and code validation. The paper provides recipes that may be used to reproduce our results in environments similar to this cluster.  Colfax-HPL-Intel-Xeon-Phi-x200-and-Intel-Omni-Path-100.pdf (130 KB) — this file is available only to registered users. Register or Log In. Table of Contents Section 1. HPLinpack Benchmark Algorithm HPL Configuration File Section 2. System Configuration Intel Architecture Colfax Cluster Section 3. Results Recipe Performance Impact of System Configuration Section 4. Summary Section 1. HPLinpack Benchmark The HPLinpack benchmark generates and solves on distributed-memory computers a large dense system of linear algebraic equations with random coefficients. The benchmark exercises the floating-point arithmetic units, the memory subsystem, [...]

Intel® Python* on 2nd Generation Intel® Xeon Phi™ Processors: Out-of-the-Box Performance

June 20, 2016

This paper reports on the value and performance for computational applications of the Intel® distribution for Python* 2017 Beta on 2nd generation Intel® Xeon Phi™ processors (formerly codenamed Knights Landing). Benchmarks of LU decomposition, Cholesky decomposition, singular value decomposition and double precision general matrix-matrix multiplication routines in the SciPy and NumPy libraries are presented, and tuning methodology for use with high-bandwidth memory (HBM) is laid out. Download as PDF:  Colfax-Intel-Python.pdf (1 MB) — this file is available only to registered users. Register or Log In. or read online below. Code: coming soon, check back later. See also: 1. A Case for Python in Computing Python is a popular scripting language in computational applications. Empowered with the fundamental tools for scientific computing, NumPy and SciPy libraries, Python applications can express in brief and convenient form basic linear algebra subroutines (BLAS) and linear algebra package (LAPACK) [...]

Software Developer’s Introduction to the HGST Ultrastar Archive Ha10 SMR Drives

July 31, 2015

In this paper we will discuss the new HGST Shingled Magnetic Recording (SMR) drives, Ultrastar Archive Ha10, which offers storage capacities of 10 TB and beyond. With their high-density storage capacities, these drives are well suited for large “active archive” applications. In an active archive application, the data is frequently read but seldom modified. The SMR drives are host managed, meaning that the developer must manage the data storage on the drives. In this publication we introduce an open source library, libzbc, which was developed by the HGST team to assist developers who use SMR drives. The discussions cover topics from the very basics like opening a device, to more advanced topics like data padding. The goal of this paper is to give readers the necessary knowledge and tools to develop applications with libzbc. We will present an example, and then report several benchmarks of I/O operations on the HGST SMR drives, and discuss the SMR drive’s effectiveness as an active archive solution. Complete paper:  HGST_Introduction_to_libzbc.pdf (361 KB) — this [...]

Performance to Power and Performance to Cost Ratios with Intel Xeon Phi Coprocessors (and why 1x Acceleration May Be Enough)

January 27, 2015

The paper studies two performance metrics of systems enabled with Intel Xeon Phi coprocessors: the ratio of performance to consumed electrical power and the ratio of performance to purchasing system cost, both under the assumption of linear parallel scalability of the application. Performance to power values are measured for three workloads: a compute-bound workload (DGEMM), a memory bandwidth-bound workload (STREAM), and a latency-limited workload (small matrix LU decomposition). Performance to cost ratios are computed, using system configurations and prices available at Colfax International, as functions of the acceleration factor and of the number of coprocessors per system. That study considers hypothetical applications with acceleration factor from 0.35x to 2x. In all studies, systems with Intel Xeon Phi coprocessors yield better metrics than systems with only Intel Xeon processors. That applies even with acceleration factor of 1x, as long as the application can be distributed between the CPU and the coprocessor. Complete paper:  Colfax_1x.pdf (321 KB) — this file is [...]

File I/O on Intel Xeon Phi Coprocessors: RAM disks, VirtIO, NFS and Lustre

July 28, 2014

The key innovation brought about by Intel Xeon Phi coprocessors is the possibility to port most HPC applications to manycore computing accelerators without code modification. One of the reasons why this is possible is support for file input/output (I/O) directly from applications running on coprocessors. These facilities allow seamless usage of manycore accelerators in common HPC tasks such as application initialization from file data, saving running output, checkpointing and restarting, data post-processing and visualization, and other. This paper provides information and benchmarks necessary to make the choice of the best file system for a given application from a number of the available options: RAM disks, virtualized local hard drives, and distributed storage shared with NFS or Lustre. We report benchmarks of I/O performance and parallel scalability on Intel Xeon Phi coprocessors, strengths and limitations of each option. In addition, the paper presents system administration procedures necessary for using each file system on coprocessors, including bridged networking and [...]

How to Write Your Own Blazingly Fast Library of Special Functions for Intel Xeon Phi Coprocessors

May 3, 2013

Statically-linked libraries are used in business and academia for security, encapsulation, and convenience reasons. Static libraries with functions offloadable to Intel Xeon Phi coprocessors must contain executable code for both the host and the coprocessor architecture. Furthermore, for library functions used in data-parallel contexts, vectorized versions of the functions must be produced at the compilation stage. This white paper shows how to design and build statically-linked libraries with functions offloadable to Intel Xeon Phi coprocessors. In addition, it illustrates how special functions with scalar syntax (e.g., y=f(x)) can be implemented in such a way that user applications can use them in thread- and data-parallel contexts. The second part of the paper demonstrates some optimization methods that improve the performance of functions with scalar syntax on the multi-core and the many-core platforms: precision control, strength reduction, and algorithmic optimizations. Complete paper:  Colfax_Static_Libraries_Xeon_Phi.pdf (426 KB) — this file is available only to [...]

Arithmetics on Intel’s Sandy Bridge and Westmere CPUs: not all FLOPs are created equal

April 30, 2012

This paper presents a new arithmetic efficiency benchmark and uses it to compare the Intel Sandy Bridge E5-2680 CPU to the Intel Westmere X5690 CPU performance. The efficiency is measured for single and double precision floating point operations: addition, multiplication, division, square root and the exponential function, and for 32- and 64-bit integer operations: addition, multiplication and division. The SSE2 and AVX instruction sets, as well as scalar operations, in single-threaded and multi-threaded modes are covered. This benchmark eliminates the effects of memory bandwidth and latency by fitting the calculation in the L1 cache. The bandwidth of the L1 cache and main memory (RAM) are estimated for reference, and the LINPACK benchmark result is reported. Results show that the E5-2680 CPU performs floating point addition and multiplication dramatically faster (up to 2.6x) than the X5690 model. However, the floating point division and square root are the new model’s weak spots. AVX floating point operations addition and multiplication are up to 2.0x faster than the SSE2; [...]
1 2