Articles by Andrey

A Performance-Based Comparison of C/C++ Compilers

November 11, 2017

This paper reports a performance-based comparison of six state-of-the-art C/C++ compilers: AOCC, Clang, G++, Intel C++ compiler, PGC++, and Zapcc. We measure two aspects of the compilers’ performance: The speed of compiled C/C++ code parallelized with OpenMP 4.x directives for multi-threading and vectorization. The compilation time for large projects with heavy C++ templating. In addition to measuring the performance, we interpret the results by examining the assembly instructions produced by each compiler. The tests are performed on an Intel Xeon Platinum processor featuring the Skylake architecture with AVX-512 vector instructions.  Colfax_Compiler_Comparison.pdf (562 KB) — this file is available only to registered users. Register or Log In.   1. The Importance of a Good Compiler Modern x86-64 CPUs are highly complex CISC architecture machines. Modern vector extensions to the x86-64 architecture, such as AVX2 and AVX-512, have instructions developed to handle common computational kernels. For example, the fused multiply-add instruction is used to increase the [...]

A Survey and Benchmarks of Intel® Xeon® Gold and Platinum Processors

November 7, 2017

This paper provides quantitative guidelines and performance estimates for choosing a processor among the Platinum and Gold groups of the Intel Xeon Scalable family (formerly Skylake). The performance estimates are based on detailed technical specifications of the processors, including the efficiency of the Intel Turbo Boost technology. The achievable performance metrics are experimentally validated on several processor models with synthetic workloads. The best choice of the processor must take into account the nature of the application for which the processor is intended: multi-threading or multi-processing efficiency, support for vectorization, and dependence on memory bandwidth. Printable (PDF):  Colfax-Xeon-Scalable.pdf (334 KB) — this file is available only to registered users. Register or Log In. 1. Which Xeon is Right for You? In 2017, the Intel Xeon Scalable processor family was released, featuring the Skylake architecture. Processors in the Scalable family support Intel Advanced Vector Extensions 512 (Intel AVX-512) (see, e.g., this paper), improved cache and [...]

Stanford AI Hackathon: Cluster Request

September 20, 2017

To support the computing needs of this hackathon participants, Intel® Nervana™ AI Academy is making computing resources available via Remote Access. Request Access Now Please fill out and submit the form below to request access to the Colfax Cluster. You will get additional instructions via the email address that you provide. Do you have a free account at Colfax Research? Save time by logging in – we will fill in the fields with data from your profile. What’s Inside When you get access, you will log in to a Linux-based head node of a batch farm. There you can stage your code and data, compile, and submit calculations to a queue. Once the queued job completes, your results will be in your home folder. Your account is active until Nov 5th 11:59 PDT Jobs are scheduled on Intel® Xeon Phi™ processors 7210 (formerly Knights Landing) Each processor has 64 cores with 4-way hyper-threading. Each processor has access to 96 GiB of on-platform RAM (DDR4) and 16 GiB of high-bandwidth memory (MCDRAM). Only one job will run on any processor at a time. You will get 200 GB of file [...]

HOW Series “Deep Dive”: Webinars on Performance Optimization – 2017 Edition

June 30, 2017

Register Why Attend Roadmap Instructor Prerequisites Cluster Materials Software Book Chat   In a Nutshell HOW Series “Deep Dive” is a free Web-based training on parallel programming and performance optimization on Intel architecture. The workshop includes 20 hours of instruction and up to 2 weeks of remote access to dedicated training servers for hands-on exercises. This training is free to everyone thanks to Intel’s sponsorship.   You can get trained in one of the two ways: Self-paced: Start Right Now You can access the video recordings of lectures, slides of presentations and code of practical exercises on this page using a free Colfax Research account. This option is free and open to everyone, however, self-paced study does not give you the benefits that you get by joining a workshop (which is also free, but tied to specific dates). To Registration   Upcoming Workshops: (the 10-day workshops occurring in different months have the same agenda) October 2017 M T W H F Sa Su 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 [...]

Webinar: Demystifying Vectorization

May 18, 2017

Free Webinar Abstract Have you heard of code vectorization, but not sure how it applies to your work? Rest assured, you are in a good company. Furthermore, even seasoned computing professionals have a good excuse for not being familiar with this concept! That said, now is a great time to learn about writing vectorized code. That is because in modern Intel processors, vector instructions may speed up arithmetic instructions by up to a factor of 16. However, you must design computational code in a way that makes vector processing possible. In this 1-hour webinar I will explain what to expect from vectorization, and how to make sure that your code has it: Manual and compiler-assisted vectorization Assessing your success with vectorization Loop was vectorized – what’s next? Speaker Andrey Vladimirov, Head of HPC Research, Colfax International Dr. Andrey Vladimirov’s primary research interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, Andrey was involved in theoretical astrophysics [...]

Get the Most out of Your Free Trial of Intel Xeon Phi Processors

April 7, 2017

Free Webinar Abstract Intel® Xeon Phi™ processors x200 (formerly Knights Landing) are computational beasts. Their theoretical peak performance is up to 3 TFLOP/s and measured memory bandwidth is up to 490 GB/s. This performance is available without any difference in programming models compared to general-purpose x86-like CPUs. Colfax is offering a free trial program for this technology. This program is available through Intel’s sponsorship. The Colfax Cluster has 64 compute nodes based on Intel Xeon Phi 7250 processors. Intel® Omni-Path fabric interconnects the nodes. This cluster is at your service for two weeks for testing and evaluation. In this 1-hour webinar I will describe how you can get the most out of your two weeks on the cluster: What workloads you can run to see the performance How to prepare your own code to run on the cluster Where to learn the best optimization practices for this and similar architectures Slides:  Colfax-Remote-Access-Webinar-2017.pdf (2 MB) — this file is available only to registered users. Register or Log In. Free trial: here [...]

MC² Series: Modern Code Contributed Talks

February 10, 2017

In Modern Code Contributed Talks, or MC² Series, experts in computational disciplines share their experience. Register for these ongoing webinars to learn the performance optimization methods used in real-life applications. Would you like to contribute a talk? Contact us. Scholarship is available in the form of access to a diverse collection of powerful computing [...]

Modern Code for Intel Xeon Phi Processors

December 8, 2016

This series of 45-minute webinars was presented by Colfax International in collaboration with Intel in 2016.   ► Part 1 | ► Part 2 | ► Part 3 1. Strategies for Multi-Threading on Intel Xeon Phi Processors Practical recipes for optimizing performance in multi-threaded computational applications on Intel Xeon Phi processors. Presentation covers common issues with thread parallelism: excessive synchronization, false sharing, insufficient iteration space size, and methods for overcoming these issues: parallel reduction, data padding, strip-mining and loop collapse, and nested parallelism. ► Click to watch recording (45 min) – this webinar aired September 28, 2016 Slides:  Colfax_Modern_Code_Webinar_01.pdf (5 MB) — this file is available only to registered users. Register or Log In. 2. Fine-Tuning Vectorization on Intel Xeon Phi Processors Vectorization of computational applications on Intel Xeon Phi processors. Covers automatic vectorization essentials and the toolkit for advanced tuning of vectorization performance, including compiler directives, data [...]

Clustering Modes in Knights Landing Processors

May 11, 2016

This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™ processors code-named Knights Landing (KNL). In this document we discuss the clustering modes of the on-die mesh interconnect. We start a discussion on what types of applications benefit from the clustering modes and why clustering modes help these applications. After that we cover the specifics of the available cluster modes: all-to-all, quadrant, hemisphere, SNC-4 and SNC-2. Finally, we discuss how to make the application NUMA-aware for use in SNC modes. In this context, we give recipes for nested OpenMP and hybrid MPI+OpenMP approaches combined with first-touch allocation policy, numactl tool and memkind library.  Colfax_KNL_Clustering_Modes_Guide.pdf (376 KB) — this file is available only to registered users. Register or Log In. See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ 1. Cache Organization in KNL 2nd generation Intel® Xeon Phi™processors code-named Knights Landing (KNL) are specialized [...]

Optimization Techniques for the Intel MIC Architecture. Part 3 of 3: False Sharing and Padding

August 8, 2015

This is part 3 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we discuss false sharing, highlighting the situations in which it may occur, and eliminating it with the help of data container padding. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Similar workloads occur in Monte Carlo simulations, particle physics software, and statistical analysis. Results show that the impact of false sharing may be as high as an order of magnitude performance loss in a parallel application. On Intel Xeon processors, padding required to eliminate false sharing is greater than on Intel Xeon Phi coprocessors, so target-specific padding values may be used in real-life applications. See also: Part 1: Multi-Threading and Parallel Reduction Part 2: Strip-Mining for Vectorization Part 3: False Sharing and Padding Complete paper: [...]
1 2 3 4