Welcome! You are viewing archived content of the Colfax Research project. For our current business page, see colfax-intl.com

Articles by Andrey

Are You Realizing the Payoff of Parallel Processing?

July 10, 2015

My contributed article has just been published at Intel Communities. …as Intel processor architectures evolve, you get performance boosts in some areas without doing anything with your code. For instance, such architectural improvements as bigger caches, instruction pipelining, smarter branch prediction, and prefetching improve performance of some applications without any changes in the code. However, parallelism is different. To realize the full potential of the capabilities of multiple cores and vectors, you have to make your application aware of parallelism. That is what code modernization is about: it is the process of adapting applications to new hardware capabilities, especially parallelism on multiple levels. … Once you have a robust version of code, you are basically future-ready. You shouldn’t have to make major modifications to take advantage of new generations of the Intel architecture. Just like in the past, when computing applications could “ride the wave” of increasing clock frequencies, your modernized code will be able to automatically take [...]

Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization

June 26, 2015

This is part 2 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors). In this paper we discuss data parallelism. Our focus is automatic vectorization and exposing vectorization opportunities to the compiler. For a practical illustration, we construct and optimize a micro-kernel for particle binning particles. Similar workloads occur applications in Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to code vectorization, which results in an order of magnitude performance improvement on an Intel Xeon processor. Performance on Xeon Phi compared to that on a high-end Xeon is 1.4x greater in single precision and 1.6x greater in double precision. See also: Part 1: Multi-Threading and Parallel Reduction Part 2: Strip-Mining for Vectorization Part 3: False Sharing and Padding Complete paper:  Colfax_Optimization_Techniques_2_of_3.pdf (650 KB) [...]

Scientific Computing with Intel Xeon Phi Coprocessors

February 4, 2015

I had the privilege of giving a presentation at the HPC Advisory Council Stanford Conference 2015. Thanks to insideHPC, a recording of this presentation is available on YouTube. Slides are available here and here:  Colfax-HPCAC.pdf () If you are interested in individual case studies mentioned in the talk, here they are: Paper: 2013a, 2013b Papers: 2013, 2014 Paper: 2013 Paper: [...]

Fluid Dynamics with Fortran on Intel Xeon Phi coprocessors

February 4, 2015

In this demonstration, a Colfax ProEdge™ SXP8400 workstation runs a shallow water flow solver, demonstrating CFD acceleration with Intel Xeon Phi coprocessors. The key feature of this demonstration is that exactly the same source code is used to compile the MPI executables for the Intel Xeon E5-2697 V3 processor and for Intel Xeon Phi 7120A coprocessors. The code is written in Fortran with OpenMP and MPI. For performance results with this code in a MIC-enabled cluster, see companion [...]

Performance to Power and Performance to Cost Ratios with Intel Xeon Phi Coprocessors (and why 1x Acceleration May Be Enough)

January 27, 2015

The paper studies two performance metrics of systems enabled with Intel Xeon Phi coprocessors: the ratio of performance to consumed electrical power and the ratio of performance to purchasing system cost, both under the assumption of linear parallel scalability of the application. Performance to power values are measured for three workloads: a compute-bound workload (DGEMM), a memory bandwidth-bound workload (STREAM), and a latency-limited workload (small matrix LU decomposition). Performance to cost ratios are computed, using system configurations and prices available at Colfax International, as functions of the acceleration factor and of the number of coprocessors per system. That study considers hypothetical applications with acceleration factor from 0.35x to 2x. In all studies, systems with Intel Xeon Phi coprocessors yield better metrics than systems with only Intel Xeon processors. That applies even with acceleration factor of 1x, as long as the application can be distributed between the CPU and the coprocessor. Complete paper:  Colfax_1x.pdf (321 [...]

Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: LU Decomposition of Small Matrices

January 27, 2015

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel Xeon Phi coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation. In addition, the loop tiling technique of memory traffic tuning is shown. The optimization methods are illustrated on an example of single-threaded LU decomposition of a single precision matrix of size 128×128. Benchmarks show that the discussed optimizations improve the performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor. The code discussed in the paper can be freely downloaded from this page. Complete paper:  Colfax_LU.pdf (604 KB) Source code for Linux: colfax-lu.tgz (17 [...]

Crash Course on Programming and Optimization with Intel Xeon Phi Coprocessors at SC14

November 16, 2014

Programming and optimization of applications for Intel Xeon Phi processors is going to be discussed in more than ten presentations in four concurrent track sessions at the Intel HPC Developer Conference at SC14 in New Orleans, LA on November 16, 2014. Colfax has contributed two of these presentations: one a crash course on the applicability domain and programming models for Intel Xeon Phi coprocessors, and another a demonstration of optimization of an N-body simulation for coprocessors on the node level and cluster level. Slides of our presentations can be downloaded from this page. Stay tuned for an upcoming Colfax Research paper with downloadable code for the example demonstrated in our slides. If you are attending SC14 in New Orleans, visit us at Colfax’s booth 1047 and also at the Intel Channel Pavilion. Part 1. Introduction, Programming Models:  Colfax-Intro.pdf (10 MB) Part 2. Optimization Techniques:  Colfax-Optimization.pdf (9 [...]

Installing Intel MPSS 3.3 in Arch Linux

August 20, 2014

This technical publication provides instructions for installing the Intel Manycore Platform Software Stack (MPSS) version 3.3 in Arch Linux operating system. Intel MPSS is a suite of tools necessary for operation of Intel Xeon Phi coprocessors. Instructions provided here enable offload and networking functionality for coprocessors in Arch Linux. The procedure described in this paper is completely reversible via an uninstallation script. Downloads: Product Direct Link Intel MPSS 3.3 (page, archive) mpss-3.3-linux.tar (~400 MB) Linux Kernel 3.10 LTS (AUR) linux-lts310.tar.gz (78 KB) TRee Installation Generator (TRIG) trig.sh (3 KB) RHEL networking utilities rhnet.tgz (34 KB) Offload functionality test Offload-Hello.cc (347 B) GNU Public License v2 (applies to TRIG and RHEL utilities) page Paper:  Colfax_MPSS_in_Arch_Linux.pdf (97 KB) Make sure to read important additional in the “Comments” below [...]

File I/O on Intel Xeon Phi Coprocessors: RAM disks, VirtIO, NFS and Lustre

July 28, 2014

The key innovation brought about by Intel Xeon Phi coprocessors is the possibility to port most HPC applications to manycore computing accelerators without code modification. One of the reasons why this is possible is support for file input/output (I/O) directly from applications running on coprocessors. These facilities allow seamless usage of manycore accelerators in common HPC tasks such as application initialization from file data, saving running output, checkpointing and restarting, data post-processing and visualization, and other. This paper provides information and benchmarks necessary to make the choice of the best file system for a given application from a number of the available options: RAM disks, virtualized local hard drives, and distributed storage shared with NFS or Lustre. We report benchmarks of I/O performance and parallel scalability on Intel Xeon Phi coprocessors, strengths and limitations of each option. In addition, the paper presents system administration procedures necessary for using each file system on coprocessors, including bridged networking and [...]

Cluster-Level Tuning of a Shallow Water Equation Solver on the Intel MIC Architecture

May 12, 2014

The paper demonstrates the optimization of the execution environment of a hybrid OpenMP+MPI computational fluid dynamics code (shallow water equation solver) on a cluster enabled with Intel Xeon Phi coprocessors. The discussion includes: Controlling the number and affinity of OpenMP threads to optimize access to memory bandwidth; Tuning the inter-operation of OpenMP and MPI to partition the problem for better data locality; Ordering the MPI ranks in a way that directs some of the traffic into faster communication channels; Using efficient peer-to-peer communication between Xeon Phi coprocessors based on the InfiniBand fabric. With tuning, the application has 90% percent efficiency of parallel scaling up to 8 Intel Xeon Phi coprocessors in 2 compute nodes. For larger problems, scalability is even better, because of the greater computation to communication ratio. However, problems of that size do not fit in the memory of one coprocessor. The performance of the solver on one Intel Xeon Phi coprocessor 7120P exceeds the performance on a dual-socket Intel Xeon E5-2697 v2 CPU by a [...]
1 2 3 4