Presentations

Modern Code for Intel Xeon Phi Processors

December 8, 2016

This series of 45-minute webinars was presented by Colfax International in collaboration with Intel in 2016. ► Part 1 | ► Part 2 | ► Part 3 1. Strategies for Multi-Threading on Intel Xeon Phi Processors Practical recipes for optimizing performance in multi-threaded computational applications on Intel Xeon Phi processors. Presentation covers common issues with thread parallelism: excessive synchronization, false sharing, insufficient iteration space size, and methods for overcoming these issues: parallel reduction, data padding, strip-mining and loop collapse, and nested parallelism. ► Click to watch recording (45 min) – this webinar aired September 28, 2016 Slides: Colfax_Modern_Code_Webinar_01.pdf (5 MB) 2. Fine-Tuning Vectorization on Intel Xeon Phi Processors Vectorization of computational applications on Intel Xeon Phi processors. Covers automatic vectorization essentials and the toolkit for advanced tuning of vectorization performance, including compiler directives, data container optimization, and language extensions for expression of data [...]

Optimizing Torch Performance for Intel Xeon Phi Processors

November 18, 2016

In this 1-hour webinar, Ryo Asai (Colfax) discusses how machine learning applications can benefit from code modernization. He begins by exploring the parallelism that gives modern computer architecture its performance, and how it can be leveraged. Then he applies code modernization techniques live on-screen to the Torch machine learning framework. Specifically, he optimizes image recognition through a deep convolutional neural network that uses the VGG-net architecture. For each code modernization technique, he explains why it works, and how to apply it in practice. What you will learn: What code modernization is, and its importance for machine learning Practical knowledge of modern computer architectures Code modernization techniques for leveraging parallelism Slides: Colfax-Torch-VGG-Webinar.pdf (2 [...]

Knights Landing Webinar Slides Translated to Japanese

May 13, 2016

日XLsoft社の協力で、弊社の “Introduction to Next-Generation Intel® Xeon Phi™ Processor: Developer’s Guide to Knights Landing” で使われているスライド集が日本語に翻訳されました。 With the help of our partners at XLsoft, the slide deck for the webinar “Introduction to Next-Generation Intel® Xeon Phi™ Processor: Developer’s Guide to Knights Landing” has been translated to the Japanese language. XLsoft社のウェブサイト/XLsoft website Download here: JP-Colfax-Programmers-Guide-to-KNL.pdf (5 MB) For more information, and to register for the webinar, please visit: Webinar [...]

Slide Deck for Colfax Developer Training on Parallel Programming

February 26, 2016

We are making publicly available the slide deck of the Colfax developer training titled “Parallel Programming and Optimization with Intel Architecture“. This training is an intensive course for developers wishing to leverage the Intel architecture. It is also useful for many-core and multi-core processor programming. The course is based on a book of the same name, which contains targeted exercises (“labs”) for hands-on practicum. In 2014-2015, “Parallel Programming and Optimization…” has visited over 100 locations across the United States: research institutions, government labs, universities, and regional trainings. Over 2000 students attended the course. Many of these events were free to attendees thanks to Intel’s sponsorship. Update: now with new information about the upcoming 2nd generation Intel Xeon Phi processor (Knights Landing, KNL). Slide deck: Colfax-Developer-Training.pdf (13 MB) (last updated August [...]

Scientific Computing with Intel Xeon Phi Coprocessors

February 4, 2015

I had the privilege of giving a presentation at the HPC Advisory Council Stanford Conference 2015. Thanks to insideHPC, a recording of this presentation is available on YouTube. Slides are available here and here: Colfax-HPCAC.pdf () If you are interested in individual case studies mentioned in the talk, here they are: Paper: 2013a, 2013b Papers: 2013, 2014 Paper: 2013 Paper: [...]

Crash Course on Programming and Optimization with Intel Xeon Phi Coprocessors at SC14

November 16, 2014

Programming and optimization of applications for Intel Xeon Phi processors is going to be discussed in more than ten presentations in four concurrent track sessions at the Intel HPC Developer Conference at SC14 in New Orleans, LA on November 16, 2014. Colfax has contributed two of these presentations: one a crash course on the applicability domain and programming models for Intel Xeon Phi coprocessors, and another a demonstration of optimization of an N-body simulation for coprocessors on the node level and cluster level. Slides of our presentations can be downloaded from this page. Stay tuned for an upcoming Colfax Research paper with downloadable code for the example demonstrated in our slides. If you are attending SC14 in New Orleans, visit us at Colfax’s booth 1047 and also at the Intel Channel Pavilion. Part 1. Introduction, Programming Models: Colfax-Intro.pdf (10 MB) Part 2. Optimization Techniques: Colfax-Optimization.pdf (9 [...]

Primer on Computing with Intel Xeon Phi Coprocessors

March 6, 2014

Geant4 is a high energy physics application package for simulation of elementary particle transport through matter. It is used in fundamental physics experiments, as well as in industrial and medical applications. For example, the ATLAS detector at LHC and the Fermi Gamma-Ray Space Telescope rely on Geant4 simulations, DNA damage due to ionizing radiation is studied by a derivative project Geant4-DNA, and radiotherapy planning can benefit from calculations with Geant4. Geant4 has long been employing distributed-memory parallelism in the MPI framework. However, due to the trend of increasing ratio of core count to memory size in modern computing systems, and due to the need to process larger geometry models, Geant4 is undergoing modernization through inclusion of thread parallelism in shared memory. This effort is led by SLAC researchers Dr. Makoto Asai and Dr. Andrea Dotti (see, e.g., slides 1 and slides 2). A beneficial by-product of such modernization is the possibility to use the Intel Many Integrated Core (MIC) architecture of Intel Xeon Phi coprocessors for Geant4 [...]

Accelerating Public Domain Applications: Lessons from Models of Radiation Transport in the Milky Way Galaxy

November 25, 2013

Last week I had the privilege of giving a talk at the Intel Theater at SC’13. I presented a case study done with Stanford University on using Intel Xeon Phi coprocessors for accelerating a new astrophysical library HEATCODE (HEterogeneous Architecture library for sTochastic COsmic Dust Emissivity). If this talk can be summarized in one sentence, that will be “One high performance code for two platforms is reality“. Indeed, the optimizations performed in order to optimize HEATCODE for the MIC architecture lead to a tremendous performance increase on the CPU platform. As a consequence, we have developed a high performance library which can be employed and modified both by users who have access to Xeon Phi coprocessors, and by those only using multi-core CPUs. The paper introducing HEATCODE library with details of the optimization process is under review at Computer Physics Communications. The preliminary manuscript can be obtained from arXiv, and the slides of the talk are available on this page (see links above and below). The open source code will be made available [...]

Accelerated Simulations of Cosmic Dust Heating Using the Intel Many Integrated Core Architecture

June 7, 2013

Cosmic dust absorbs starlight in the optical and ultraviolet ranges, and re-emits it in the infrared range. This process is crucial for radiative transport in our Galaxy. I am participating in a project to develop a computational tool for Galactic radiative transport simulation with stochastic light absorption and re-emission on small dust grains. This project has resulted in the development of a library called HEATCODE (HEterogeneous Architecture library for sTochastic COsmic Dust Emissivity) for fast calculation of the stochastic dust heating process using Intel Xeon Phi coprocessors. I presented HEATCODE and shared my experiences with the development and optimization of applications for Xeon Phi coprocessors in a talk at the Applied Mathematics and Statistics Department at UCSC. The slides from this talk can be downloaded here (see below). The full source code of the application, along with a detailed description of the optimization process, will soon be submitted for peer-reviewed publication, and will become publicly available. Slides from the talk: [...]

Avoiding communication saves time and energy (if you are an algorithm)

May 30, 2012

In this post, I would like to reflect on a seminar that I recently attended at Stanford University’s Institute for Computational and Mathematical Engineering. The talk was given by Prof. James Demmel, who leads the research on communication avoiding algorithms at the UC Berkley Computer Science department. The lessons I took home from this talk are two: first, the research in communication avoiding algorithms has brought about amazing optimization possibilities, which reduce the time and energy usage of a number of computing problems; and second, the trend of hardware upgrades in the academic HPC arena goes in the direction that is counter-productive for these novel methods. Why avoiding communication is important It is common knowledge that arithmetic capabilities of computing systems progress much faster than the bandwidth and latency of computer networks and random-access memory. An explanation of this trend offered by Mark Hoemmen, a student of Demmel, is that “Flops are cheap, bandwidth is money, latency is physics“. The consequence of the skyrocketing [...]