Arithmetics on Intel’s Sandy Bridge and Westmere CPUs: not all FLOPs are created equal

LinkedInTwitterGoogle+

This paper presents a new arithmetic efficiency benchmark and uses it to compare the Intel Sandy Bridge E5-2680 CPU to the Intel Westmere X5690 CPU performance. The efficiency is measured for single and double precision floating point operations: addition, multiplication, division, square root and the exponential function, and for 32- and 64-bit integer operations: addition, multiplication and division. The SSE2 and AVX instruction sets, as well as scalar operations, in single-threaded and multi-threaded modes are covered. This benchmark eliminates the effects of memory bandwidth and latency by fitting the calculation in the L1 cache. The bandwidth of the L1 cache and main memory (RAM) are estimated for reference, and the LINPACK benchmark result is reported.

Results show that the E5-2680 CPU performs floating point addition and multiplication dramatically faster (up to 2.6x) than the X5690 model. However, the floating point division and square root are the new model’s weak spots. AVX floating point operations addition and multiplication are up to 2.0x faster than the SSE2; however, AVX provides no performance gain for division and square root. 32-bit integer arithmetic operations, despite the lack of AVX integer intrinsics, are up to 3.5x faster on E5-2680. At the same time, the Sandy Bridge CPU showed a 1.15x better L1 cache performance and 2.4x greater memory bandwidth than the Westmere model.

These results lead to the conclusion that the edge of the 8-core, 2.70 GHz Sandy Bridge CPU over the 6-core, 3.46 GHz Westmere processor will be most significant in both single and double precision for linear algebra and other tasks based on addition and multiplication. Re-compilation of codes performing addition and multiplication-based tasks with AVX intrinsics instead of SSE2 should lead to additional performance benefits on Sandy Bridge. However, CPU- bound calculations heavily using the division operation and transcendental functions are likely to experience a smaller speedup from using the Sandy Bridge processor in place of Westmere. Likewise, they will benefit less from the migration from SSE2 to AVX.

Complete paper: PDF logo Colfax_FLOPS.pdf (195 kB) — this file is available only to registered users. Register or Log In.

ADDENDUM

1. Note that pipelining effects come into play when arithmetic operations are combined in a code. For instance, better performance may be obtained when additions are alternated with multiplications, as opposed to a code that performs only additions or only multiplications. See follow-up article about this effect in this article.

2. The Linpack benchmark result reported in the paper “Arithmetics on Intel’s Sandy Bridge…” was obtained using the precompiled binaries optimized for the Xeon 64-bit architecture and employing the Intel OpenMP library for shared-memory parallelization. These results are sub-optimal for this system. Running the MPI-based benchmark yielded a higher Linpack score for the dual-socket E5-2680 CPU system: 292 GFLOP/s. The key parameters of this benchmark are: 32 processes, N=39936, NB=112, PMAP=0, P=4, Q=8. Even higher scores may be possible, see Intel’s publication on this subject.

Leave a comment