You are viewing archived content (2011-2018). For current research, visit research.colfax-intl.com

Case Studies

Large Fast Fourier Transforms with FFTW 3.3 on Terabyte-RAM NUMA Servers

February 2, 2012

This paper presents the results of a Fast Fourier Transform (FFT) benchmark of the FFTW 3.3 library on Colfax’s 4-CPU, large memory servers. Unlike other published benchmarks of this library, we study two distinct cases of FFT usage: sequential and concurrent computation of multithreaded transforms. In addition, this paper provides results for very large (up to N = 231) and massively parallel (up to 80 threads) shared memory transforms, which have not yet been reported elsewhere. The FFT calculation is discussed: parallelization techniques and hardware-specific implementations; motivation for a specific astrophysical research is given. Results presented here include: dependence of performance on the transform size and on the number of threads, memory usage of multithreaded 1D FFTs, estimates of the FFT planning time. The paper shows how to optimize the performance of concurrent independent calculations on these large memory systems by setting an efficient NUMA policy. This policy partitions the machine’s resources, reducing the average memory latency. Such optimization [...]

Terabyte RAM Servers: Memory Bandwidth Benchmark and How to Boost RAM Bandwidth by 20% with a Single Command

January 4, 2012

Colfax International produces servers capable of supporting up to 1 TB of RAM and up to 4 Intel Xeon CPUs. This paper reports the memory bandwidth benchmark of these servers obtained using the STREAM code. Our benchmark includes comprehensive statistical data: the mean, standard deviation, extrema and the distribution of bandwidth measurements. The distribution of measurements reveals several modes of RAM performance, including an above-average bandwidth mode. By default, the mode realized by any given benchmark depends on an unpredictable runtime pattern of thread and memory binding to the physical cores. The paper shows how to optimize memory traffic for bandwidth and consistently achieve the fastest mode. This is done by controlling the code’s thread affinity, and results in a bandwidth increase around 20% over the average unoptimized performance. Without optimization, the measured RAM bandwidth with one thread is 5.79±0.06 GB/s (the ‘copy’ test), and it scales almost linearly with the number of threads until it peaks at 67±6 GB/s at 20 threads. [...]
1 2 3