Colfax International produces servers capable of supporting up to 1 TB of RAM and up to 4 Intel Xeon CPUs. This paper reports the memory bandwidth benchmark of these servers obtained using the STREAM code.
Our benchmark includes comprehensive statistical data: the mean, standard deviation, extrema and the distribution of bandwidth measurements. The distribution of measurements reveals several modes of RAM performance, including an above-average bandwidth mode. By default, the mode realized by any given benchmark depends on an unpredictable runtime pattern of thread and memory binding to the physical cores. The paper shows how to optimize memory traffic for bandwidth and consistently achieve the fastest mode. This is done by controlling the code’s thread affinity, and results in a bandwidth increase around 20% over the average unoptimized performance.
Without optimization, the measured RAM bandwidth with one thread is 5.79±0.06 GB/s (the ‘copy’ test), and it scales almost linearly with the number of threads until it peaks at 67±6 GB/s at 20 threads. Optimized code shows a maximum bandwidth up to 78.9±0.3 GB/s. A list of references for the NUMA architecture tools is provided.