Recent

Canonical Stratification for Non-Mathematicians

October 17, 2018

Our recent publication “Algorithmic Canonical Stratifications of Simplicial Complexes” proposes a new algorithm for data analysis that offers a topology-aware path towards explainable artificial intelligence. Despite (or, perhaps, due to) being mathematically rigorous, the text of the original work is virtually impenetrable for readers not familiar with the concepts, tools, and notation of topology. In order to convey our ideas to a wider audience, we present this supplemental introduction. Here, we summarize and explain in plain English the motivation, reasoning, and methods of our new topological data analysis algorithm that we term “canonical stratification”. Canonical-Stratification-for-Non-Mathematicians.pdf (38 KB) Table of Contents 1. Motivation 2. More on Canonical Stratification 3. Conclusion 1. Motivation Machine learning has advanced significantly in recent years and has proven itself to be a powerful and versatile tool in a variety of data-driven disciplines. Machine learning algorithms are now being used to make decisions in numerous areas [...]

How New QLC SATA SSDs Deliver 8x Faster Machine Learning

October 9, 2018

We record performance measurements on Micron 5210 SSD related to Machine Learning workflow. Even though Machine Learning is highly CPU intensive, fast storage can lower training time through faster file pre-processing and serialization, particularly when the size of a data set exceeds the amount of installed memory. A popular format for datasets is TFRecord, and in our performance measurements, we will be comparing the throughput speed and completion time of a TFRecord on a 7.68TB Micron 5210 ION SSD versus that of an 8TB Seagate 7200RPM HDD. Colfax-Machine-Learning-and-QLC-SSDs.pdf (151 KB) Table of Contents 1. QLC SSDs 2. Micron QLC SSD 3. Test System Configuration 4. Test Workload: TFRecord 5. Test Results 6. Summary 1. QLC SSDs For years, 7200 RPM hard disk drives (HDDs) have been the standard media on which Machine Learning (ML) training data sets have been stored. These traditional HDDs have been preferred due to their low cost and easy to adopt SATA interfaces. However, HDD’s suffer from relatively slow throughput . Solid State Drives (SSDs) have been too [...]

Best Practices for Speed in Deep Learning Applications on Intel Architecture

July 3, 2018

You have set up a deep learning model that you are planning to train on an Intel architecture processor. In order to be productive, you have to minimize the training time. You run the application and see that it takes N seconds for a single training epoch. How do you know if it is good? If improvement is possible, what can you do to improve the training time? Are there tools to identify a tuning strategy? Intel software development tools can answer these questions to maximize your productivity in deep learning on Intel architecture. At the Intel AI DevCon 2018 in San Francisco, Alaa Eltablawy (Colfax) presented a workshop that demonstrates how this works. For the workshop, attendees received access to the Intel® AI DevCloud, where they could experiment with the optimization of a TensorFlow-based application for image segmentation. The instructor demonstrated the performance analysis results obtained with Intel® VTune Amplifier and Application Performance Snapshot and explained how this analysis consistently guides you to the use of known “performance tuning knobs” in [...]

An optimization approach for agent-based computational models of biological development

April 9, 2018

Pablo Gonzalez-de-Aledoa, Andrey Vladimirovd, Marco Mancab, Jerry Baughc, Ryo Asaid, Marcus Kaisere,f, Roman Bauerf,e a Software Performance Optimization Group, Imperial College London, London, United Kingdom b CERN Openlab, IT Department, CERN, Switzerland c Intel Corporation, USA d Colfax International, USA e Interdisciplinary Computing and Complex BioSystems Research Group, School of Computing, Newcastle University, Newcastle upon Tyne, United Kingdom f Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, United Kingdom A paper led by Pablo Gonzales-de-Aledo (Imperial College London) with contributions from his colleagues from CERN, Intel, Colfax and Newcastle University was published in the journal Advances in Engineering Software. This is a case study on performance optimization in a biological simulation code. The code presents a highly parallel implementation of a computer simulation that involves millions of agents interacting in a 3D environment. The paper explains the general approach to transforming a sequential code to run on modern, highly [...]

Optimization of Real-Time Object Detection on Intel® Xeon® Scalable Processors

November 11, 2017

This publication demonstrates the process of optimizing an object detection inference workload on an Intel® Xeon® Scalable processor using TensorFlow. This project pursues two objectives: Achieve object detection with real-time throughput (frame rate) and low latency Minimize the required computational resources In this case study, a model described in the “You Only Look Once” (YOLO) project is used for object detection. The model consists of two components: a convolutional neural network and a post-processing pipeline. In this work, the original Darknet model is converted to a TensorFlow model. First, the convolutional neural network is optimized for inference. Then array programming with NumPy and TensorFlow is implemented for the post-processing pipeline. Finally, environment variables and other configuration parameters are tuned to maximize the computational performance. With these optimizations, real-time object detection is achieved while using a fraction of the available processing power of an Intel Xeon Scalable processor-based system. [...]

A Performance-Based Comparison of C/C++ Compilers

November 11, 2017

This paper reports a performance-based comparison of six state-of-the-art C/C++ compilers: AOCC, Clang, G++, Intel C++ compiler, PGC++, and Zapcc. We measure two aspects of the compilers’ performance: The speed of compiled C/C++ code parallelized with OpenMP 4.x directives for multi-threading and vectorization. The compilation time for large projects with heavy C++ templating. In addition to measuring the performance, we interpret the results by examining the assembly instructions produced by each compiler. The tests are performed on an Intel Xeon Platinum processor featuring the Skylake architecture with AVX-512 vector instructions. Colfax_Compiler_Comparison.pdf (562 KB) Table of Contents 1. The Importance of a Good Compiler 2. Testing Methodology 2.1. Meet the Compilers 2.2. Target Architecture 2.3. Computational Kernels 2.4. Compilation Time 2.5. Test Details 2.6. Test Platform 2.7. Code Analysis 3. Results 3.1. Performance of Compiled Code 3.2. Compilation Speed 4. Summary Appendix A. LU Decomposition Appendix B. Jacobi Solver Appendix C. Structure Function Appendix [...]