Optimization of Hamerly’s K-Means Clustering Algorithm: CFXKMeans Library

July 21, 2017

This publication describes the application of performance optimizations techniques to Hamerly’s K-means clustering algorithm. Starting with an unoptimized implementation of the algorithm, we discuss: Thread scheduling Reduction patterns SIMD reduction Unroll and jam Presented optimizations aggregate to 85.6x speedup compared to the original unoptimized implementation. Resulting implementation is packaged into a library named CFXKMeans with interfaces for C/C++ and Python. The Python interface is benchmarked using the MNIST 784 data set. The result for K=64 is compared to the performance of K-means clustering implementation in a popular machine learning framework, scikit-learn, from the Intel distribution for Python. CFXKMeans performed our benchmark tests faster than scikit-learn by a factor of 4.68x on an Intel Xeon processor E5-2699 v4 and 5.54x on an Intel Xeon Phi 7250 processor. The CFXKMeans library has C/C++ and Python API and is available under the MIT license at Printable PDF: [...]

HPLinpack Benchmark on Intel Xeon Phi Processor Family x200 with Intel Omni-Path Fabric 100

July 10, 2017

We report the performance and a simplified tuning methodology of the HPLinpack benchmark on a cluster of Intel Xeon Phi processors 7250 with an Intel Omni-Path Fabric 100 Series interconnect. Our benchmarks are taken on the Colfax Cluster, a state-of-the-art computing resource open to the public for benchmarking and code validation. The paper provides recipes that may be used to reproduce our results in environments similar to this cluster. Printable PDF:  Colfax-HPL-Intel-Xeon-Phi-x200-and-Intel-Omni-Path-100.pdf (223 KB) — this file is available only to registered users. Register or Log In. Section 1. HPLinpack Benchmark The HPLinpack benchmark generates and solves on distributed-memory computers a large dense system of linear algebraic equations with random coefficients. The benchmark exercises the floating-point arithmetic units, the memory subsystem, and the communication fabric. The result of the HPLinpack benchmark is based on the time required to solve the system. It expresses the performance of that system in floating-point operations per second (FLOP/s). To [...]