You are viewing archived content (2011-2018). For current research, visit

Articles by Ryo

Colfax visits IPCC at Hartree Centre

June 25, 2015

In the second week of our visit to Intel Performance Computing Centres (IPCC) in the U.K., we visited the IPCC at Hartree Centre. Our work involved conducting our 1-Day CDT on Monday, and then investigating scientific applications for optimization opportunities for the remaining 4 days. We investigated two applications; a molecular dynamics simulation called DL_MESO and weather simulation called Harmonie. With the help of Intel tools such as Intel VTune Amplifier, we profiled and analysed these applications for possible optimizations. For the DL_MESO application, we had the fortune of working with Michael Seaton, the primary author of DL_MESO, and together we took the Lattice-Boltzmann part of the DL_MESO application and sped it up by 45.5% on a server node based on an Intel Xeon CPU. We plan to continue collaborating with Hartree team on DL_MESO and further optimize [...]

Working with the Edinburgh Parallel Computing Center

June 17, 2015

We spent 5 days visiting the Edinburgh Parallel Computing Center the week of June 8, 2015. Our job was to assist this organization, selected as an IPCC in 2014, with the optimization of a molecular dynamics code CP2K. Our host, Adrian Jackson, was so kind as to report on our collaboration in his blog: day 1, day 2, day 3, day 4 and day 5. This week we are continuing our trip in the UK collaborating with another IPCC at the Hartree [...]

Optimization Techniques for the Intel MIC Architecture. Part 1 of 3: Multi-Threading and Parallel Reduction

May 29, 2015

This is part 1 of a 3-part educational series of publications introducing select topics on optimization of applications for the Intel multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we focus on thread parallelism and race conditions. We discuss the usage of mutexes in OpenMP to resolve race conditions. We also show how to implement efficient parallel reduction using thread-private storage and mutexes. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Workloads like this one occur in such applications as Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to a performance increase of 25x on a 24-core CPU and up to 100x on the MIC architecture compared to a single-threaded implementation on the same architectures. In the next publication of this series, we will demonstrate further optimization of this workload, focusing on vectorization. See also: Part 1: Multi-Threading [...]

Intel Cilk Plus for Complex Parallel Algorithms: “Enormous Fast Fourier Transforms” (EFFT) Library

September 18, 2014

In this paper we demonstrate the methodology for parallelizing the computation of large one-dimensional discrete fast Fourier transforms (DFFTs) on multi-core Intel Xeon processors. DFFTs based on the recursive Cooley-Tukey method have to control cache utilization, memory bandwidth and vector hardware usage, and at the same time scale across multiple threads or compute nodes. Our method builds on single-threaded Intel Math Kernel Library (MKL) implementation of DFFT, and uses the Intel Cilk Plus framework for thread parallelism. We demonstrate the ability of Intel Cilk Plus to handle parallel recursion with nested loop-centric parallelism without tuning the code to the number of cores or cache metrics. The result of our work is a library called EFFT that performs 1D DFTs of size 2^N for N>=21 faster than the corresponding Intel MKL parallel DFT implementation by up to 1.5x, and faster than FFTW by up to 2.5x. The code of EFFT is available for free download under the GPLv3 license. This work provides a new efficient DFFT implementation, and at the same time demonstrates an [...]
1 2