You are viewing archived content (2011-2018). For current research, visit


Second Edition of “Parallel Programming and Optimization with Intel Xeon Phi Coprocessors”

January 15, 2019

Our first book, “Parallel Programming and Optimization with Intel Xeon Phi Coprocessors” (second edition) is now available for free. Use the link below to download.  Parallel_Programming_and_Optimization_with_Intel_Xeon_Phi_Coprocessors_2nd_Edition.pdf (14 MB) This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License. For the code of the supplementary practical exercises (“labs”), including updates for Intel Xeon Phi processor x200 family, go to [...]

Colfax Research papers translated to Japanese

July 14, 2014

With the help of our partners at Intel, some of our articles on Intel Xeon Phi coprocessor programming were translated to the Japanese language. インテル社の協力で、弊社のインテル(R) Xeon Phi(TM) コプロセッサーのプログラミングについての白書の一部が日本語に翻訳されました。 Original: Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors Translation:  JP-Colfax_InfiniBand_for_MIC.pdf (2 MB) Original: Heterogeneous Clustering with Homogeneous Code: Accelerate MPI Applications Without Code Surgery Using Intel Xeon Phi Coprocessors Translation:  JP-Colfax_Heterogeneous_Clustering_Xeon_Phi.pdf (657 KB) Original: Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors Translation:  JP-Colfax_Transposition-7110P.pdf (987 KB) Original: Test-driving Intel Xeon Phi coprocessors with a basic N-body simulation Translation:  JP-Colfax_Nbody_Xeon_Phi-with-addendum.pdf (2 [...]

Parallel Computing in the Search for New Physics at LHC

December 2, 2013

In the past few months we have had the pleasure of collaborating with Prof. Valerie Halyo of Princeton University on modernization of a high energy physics application for the needs of the Large Hadron Collider (LHC). The objective of our project is to improve the performance of the trigger at LHC, so as to enable real-time detection of exotic collision event products, such as black holes or jets. For the numerical algorithm of the new trigger software, the Hough transform was chosen. This method allows fast detection of straight or curved tracks in a set of points (detector hits), which could be the traces of new exotic particles. The nature of the numerical Hough transform is highly parallelizable, however, existing implementations did not use hardware parallelism or used it sub-optimally. Colfax’s role in the project was to optimize a thread-parallel implementation of the Hough transform for multi-core processors. The result of our involvement was a code capable of detecting 5000 tracks in a synthetic dataset 250x faster than prior art, on a multi-core desktop CPU. By [...]

Avoiding communication saves time and energy (if you are an algorithm)

May 30, 2012

In this post, I would like to reflect on a seminar that I recently attended at Stanford University’s Institute for Computational and Mathematical Engineering. The talk was given by Prof. James Demmel, who leads the research on communication avoiding algorithms at the UC Berkley Computer Science department. The lessons I took home from this talk are two: first, the research in communication avoiding algorithms has brought about amazing optimization possibilities, which reduce the time and energy usage of a number of computing problems; and second, the trend of hardware upgrades in the academic HPC arena goes in the direction that is counter-productive for these novel methods. Why avoiding communication is important It is common knowledge that arithmetic capabilities of computing systems progress much faster than the bandwidth and latency of computer networks and random-access memory. An explanation of this trend offered by Mark Hoemmen, a student of Demmel, is that “Flops are cheap, bandwidth is money, latency is physics“. The consequence of the skyrocketing [...]