You are viewing archived content (2011-2018). For current research, visit

Search Results for cheap

Optimization of Hamerly’s K-Means Clustering Algorithm: CFXKMeans Library

July 21, 2017

This publication describes the application of performance optimizations techniques to Hamerly’s K-means clustering algorithm. Starting with an unoptimized implementation of the algorithm, we discuss: Thread scheduling Reduction patterns SIMD reduction Unroll and jam Presented optimizations aggregate to 85.6x speedup compared to the original unoptimized implementation. Resulting implementation is packaged into a library named CFXKMeans with interfaces for C/C++ and Python. The Python interface is benchmarked using the MNIST 784 data set. The result for K=64 is compared to the performance of K-means clustering implementation in a popular machine learning framework, scikit-learn, from the Intel distribution for Python. CFXKMeans performed our benchmark tests faster than scikit-learn by a factor of 4.68x on an Intel Xeon processor E5-2699 v4 and 5.54x on an Intel Xeon Phi 7250 processor. The CFXKMeans library has C/C++ and Python API and is available under the MIT license at  Colfax-Kmeans-Clustering-Optimization.pdf (365 KB) [...]

Avoiding communication saves time and energy (if you are an algorithm)

May 30, 2012

In this post, I would like to reflect on a seminar that I recently attended at Stanford University’s Institute for Computational and Mathematical Engineering. The talk was given by Prof. James Demmel, who leads the research on communication avoiding algorithms at the UC Berkley Computer Science department. The lessons I took home from this talk are two: first, the research in communication avoiding algorithms has brought about amazing optimization possibilities, which reduce the time and energy usage of a number of computing problems; and second, the trend of hardware upgrades in the academic HPC arena goes in the direction that is counter-productive for these novel methods. Why avoiding communication is important It is common knowledge that arithmetic capabilities of computing systems progress much faster than the bandwidth and latency of computer networks and random-access memory. An explanation of this trend offered by Mark Hoemmen, a student of Demmel, is that “Flops are cheap, bandwidth is money, latency is physics“. The consequence of the skyrocketing [...]