Articles by Andrey

HOW Series “Deep Dive”: Webinars on Performance Optimization – 2017 Edition

June 30, 2017

Register Why Attend Roadmap Instructor Prerequisites Cluster Materials Software Book Chat   In a Nutshell HOW Series “Deep Dive” is a free Web-based training on parallel programming and performance optimization on Intel architecture. The workshop includes 20 hours of instruction and up to 2 weeks of remote access to dedicated training servers for hands-on exercises. This training is free to everyone thanks to Intel’s sponsorship.   You can get trained in one of the two ways: Self-paced: Start Right Now You can access the video recordings of lectures, slides of presentations and code of practical exercises on this page using a free Colfax Research account. This option is free and open to everyone, however, self-paced study does not give you the benefits that you get by joining a workshop (which is also free, but tied to specific dates). To Registration   Upcoming Workshops: (the 10-day workshops occurring in different months have the same agenda) September 2017 M T W H F Sa Su 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 [...]

Webinar: Demystifying Vectorization

May 18, 2017

Free Webinar Abstract Have you heard of code vectorization, but not sure how it applies to your work? Rest assured, you are in a good company. Furthermore, even seasoned computing professionals have a good excuse for not being familiar with this concept! That said, now is a great time to learn about writing vectorized code. That is because in modern Intel processors, vector instructions may speed up arithmetic instructions by up to a factor of 16. However, you must design computational code in a way that makes vector processing possible. In this 1-hour webinar I will explain what to expect from vectorization, and how to make sure that your code has it: Manual and compiler-assisted vectorization Assessing your success with vectorization Loop was vectorized – what’s next? Speaker Andrey Vladimirov, Head of HPC Research, Colfax International Dr. Andrey Vladimirov’s primary research interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, Andrey was involved in theoretical astrophysics [...]

Get the Most out of Your Free Trial of Intel Xeon Phi Processors

April 7, 2017

Free Webinar Abstract Intel® Xeon Phi™ processors x200 (formerly Knights Landing) are computational beasts. Their theoretical peak performance is up to 3 TFLOP/s and measured memory bandwidth is up to 490 GB/s. This performance is available without any difference in programming models compared to general-purpose x86-like CPUs. Colfax is offering a free trial program for this technology. This program is available through Intel’s sponsorship. The Colfax Cluster has 64 compute nodes based on Intel Xeon Phi 7250 processors. Intel® Omni-Path fabric interconnects the nodes. This cluster is at your service for two weeks for testing and evaluation. In this 1-hour webinar I will describe how you can get the most out of your two weeks on the cluster: What workloads you can run to see the performance How to prepare your own code to run on the cluster Where to learn the best optimization practices for this and similar architectures Slides:  Colfax-Remote-Access-Webinar-2017.pdf (2 MB) — this file is available only to registered users. Register or Log In. Free trial: here [...]

MC² Series: Modern Code Contributed Talks

February 10, 2017

In Modern Code Contributed Talks, or MC² Series, experts in computational disciplines share their experience. Register for these ongoing webinars to learn the performance optimization methods used in real-life applications. Would you like to contribute a talk? Contact us. Scholarship is available in the form of access to a diverse collection of powerful computing [...]

Modern Code for Intel Xeon Phi Processors

December 8, 2016

This series of 45-minute webinars was presented by Colfax International in collaboration with Intel in 2016.   ► Part 1 | ► Part 2 | ► Part 3 1. Strategies for Multi-Threading on Intel Xeon Phi Processors Practical recipes for optimizing performance in multi-threaded computational applications on Intel Xeon Phi processors. Presentation covers common issues with thread parallelism: excessive synchronization, false sharing, insufficient iteration space size, and methods for overcoming these issues: parallel reduction, data padding, strip-mining and loop collapse, and nested parallelism. ► Click to watch recording (45 min) – this webinar aired September 28, 2016 Slides:  Colfax_Modern_Code_Webinar_01.pdf (5 MB) — this file is available only to registered users. Register or Log In. 2. Fine-Tuning Vectorization on Intel Xeon Phi Processors Vectorization of computational applications on Intel Xeon Phi processors. Covers automatic vectorization essentials and the toolkit for advanced tuning of vectorization performance, including compiler directives, data [...]

Clustering Modes in Knights Landing Processors

May 11, 2016

This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™ processors code-named Knights Landing (KNL). In this document we discuss the clustering modes of the on-die mesh interconnect. We start a discussion on what types of applications benefit from the clustering modes and why clustering modes help these applications. After that we cover the specifics of the available cluster modes: all-to-all, quadrant, hemisphere, SNC-4 and SNC-2. Finally, we discuss how to make the application NUMA-aware for use in SNC modes. In this context, we give recipes for nested OpenMP and hybrid MPI+OpenMP approaches combined with first-touch allocation policy, numactl tool and memkind library.  Colfax_KNL_Clustering_Modes_Guide.pdf (376 KB) — this file is available only to registered users. Register or Log In. See also: colfaxresearch.com/get-ready-for-intel-knights-landing-3-papers/ 1. Cache Organization in KNL 2nd generation Intel® Xeon Phi™processors code-named Knights Landing (KNL) are specialized [...]

Optimization Techniques for the Intel MIC Architecture. Part 3 of 3: False Sharing and Padding

August 8, 2015

This is part 3 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we discuss false sharing, highlighting the situations in which it may occur, and eliminating it with the help of data container padding. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Similar workloads occur in Monte Carlo simulations, particle physics software, and statistical analysis. Results show that the impact of false sharing may be as high as an order of magnitude performance loss in a parallel application. On Intel Xeon processors, padding required to eliminate false sharing is greater than on Intel Xeon Phi coprocessors, so target-specific padding values may be used in real-life applications. See also: Part 1: Multi-Threading and Parallel Reduction Part 2: Strip-Mining for Vectorization Part 3: False Sharing and Padding Complete paper: [...]

Are You Realizing the Payoff of Parallel Processing?

July 10, 2015

My contributed article has just been published at Intel Communities. …as Intel processor architectures evolve, you get performance boosts in some areas without doing anything with your code. For instance, such architectural improvements as bigger caches, instruction pipelining, smarter branch prediction, and prefetching improve performance of some applications without any changes in the code. However, parallelism is different. To realize the full potential of the capabilities of multiple cores and vectors, you have to make your application aware of parallelism. That is what code modernization is about: it is the process of adapting applications to new hardware capabilities, especially parallelism on multiple levels. … Once you have a robust version of code, you are basically future-ready. You shouldn’t have to make major modifications to take advantage of new generations of the Intel architecture. Just like in the past, when computing applications could “ride the wave” of increasing clock frequencies, your modernized code will be able to automatically take [...]

Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization

June 26, 2015

This is part 2 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we discuss data parallelism. Our focus is automatic vectorization and exposing vectorization opportunities to the compiler. For a practical illustration, we construct and optimize a micro-kernel for particle binning particles. Similar workloads occur applications in Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to code vectorization, which results in an order of magnitude performance improvement on an Intel Xeon processor. Performance on Xeon Phi compared to that on a high-end Xeon is 1.4x greater in single precision and 1.6x greater in double precision. See also: Part 1: Multi-Threading and Parallel Reduction Part 2: Strip-Mining for Vectorization Part 3: False Sharing and Padding Complete paper:  Colfax_Optimization_Techniques_2_of_3.pdf (650 KB) [...]

Scientific Computing with Intel Xeon Phi Coprocessors

February 4, 2015

I had the privilege of giving a presentation at the HPC Advisory Council Stanford Conference 2015. Thanks to insideHPC, a recording of this presentation is available on YouTube. Slides are available here and here:  Colfax-HPCAC.pdf () — this file is available only to registered users. Register or Log In. If you are interested in individual case studies mentioned in the talk, here they are: Paper: 2013a, 2013b Papers: 2013, 2014 Paper: 2013 Paper: [...]
1 2 3 4