Avoiding communication saves time and energy (if you are an algorithm)

In this post, I would like to reflect on a seminar that I recently attended at Stanford University’s Institute for Computational and Mathematical Engineering. The talk was given by Prof. James Demmel, who leads the research on communication avoiding algorithms at the UC Berkley Computer Science department. The lessons I took home from this talk are two: first, the research in communication avoiding algorithms has brought about amazing optimization possibilities, which reduce the time and energy usage of a number of computing problems; and second, the trend of hardware upgrades in the academic HPC arena goes in the direction that is counter-productive for these novel methods.

Why avoiding communication is important

It is common knowledge that arithmetic capabilities of computing systems progress much faster than the bandwidth and latency of computer networks and random-access memory. An explanation of this trend offered by Mark Hoemmen, a student of Demmel, is that Flops are cheap, bandwidth is money, latency is physics. The consequence of the skyrocketing arithmetic capabilities, paraphrasing Prof. Demmel’s words, is that even if one’s algorithm is not bottlenecked by the communication of data today, it will be in the near future, when the number of FLOP/s supported by available clusters has grown much more than the network throughput. And this is where communication-avoiding algorithms come in. Indeed, this area of research has born fruit in the last few years (see, e.g., Prof. Demmel’s publication list on DBLP). Even for common operations, such as matrix multiplication, LU decomposition or the N-body problem, novel algorithms can boost speed and reduce energy usage; Prof. Demmel’s talk showed examples where improvements were as large as an order of magnitude compared to traditional methods!

More memory + Novel algorithm = Less communication

At this point, it is valid to ask the question, “Ok, what’s the catch?” The answer is one word: memory. The work “Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms” by Solomonik and Demmel explains it in this way:

Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon’s algorithm and be faster in practice. “3D” algorithms arrange the p processors in a 3D array, and store redundant copies of the matrices on each of p layers. “2D” algorithms such as Cannon’s algorithm store a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we take advantage of any amount of extra memory to store c copies of the data, for any c ∈ {1, 2, …, [p]}, to reduce the bandwidth cost of Cannon’s algorithm by a factor of c½ and the latency cost by a factor c3/2

According to Prof. Demmel’s presentation, storing additional copies of the data to reduce the communication cost has proven to be advantageous for a number of O(n3) and Strassen-like problems; “s-step” Krylov subspace methods; and general programs with array references where the subscripts are linear functions of the loop references (e.g., the N-body problem). The optimum performance produces nearly perfect scaling in time and energy (i.e., the computing time scales as T=p−1 and energy consumption scales as E=const).

Is the academic HPC world up to speed?

There is no doubt that communication avoiding algorithms will play a very important role in the future of HPC. In fact, these methods were mentioned in President Obama’s Energy Budget Request to Congress (see also this press-release for page numbers and quotes). However, what has the trend in HPC hardware developments been so far? After the Prof. Demmel’s talk I had a very informative discussion with another attendee of the seminar, SLAC scientist Ki Hwan Lee, who has brought to my attention that the recent upgrade of state-of-the art HOPPER cluster resulted in less memory per core than before the upgrade. At the moment, HOPPER’s 6,000 compute nodes have 32 GB and 384 nodes feature 64 GB of RAM shared between 24 processor cores.

The lack of large shared-memory computing resources available to academia is not news; however, these resources do exist elsewhere. As an example, allow me to refer to the recent Colfax Research publication, which mentions an astrophysical computing effort involving terabyte-RAM servers. These servers were provided to the Fermi Large Area Telescope researchers by Colfax International on a temporary basis, and similar resources could not be found in academic institutions. In that publication, I have already advocated for the utility of large memory compute nodes from the perspective of efficiency:

  • first, they allow in-core solution of exotic computational problems, such as the huge 1D FFTs necessary for our astrophysical project, which eliminates network traffic completely;
  • second, for memory-bound batch calculations, they allow to shift from thread-level parallelism to process-level parallelism, eliminating the multithreading overhead and thus increasing scalability;
  • and finally, they may replace several small-memory compute nodes, reducing the amount, cost and maintenance effort of cluster infrastructure.

While these large-memory machines are routinely ordered by Colfax’s customers as stand-alone compute nodes, they are generally not used as a part of an HPC solution. In fact, this technology is so alien to the scientific HPC community that I often have to repeat “a terabyte of RAM” twice in discussions with experts in scientific computing.

Conclusions

What does one take away from this information? In my opinion, with teraflop-capable GPUs overtaking the arithmetically-intensive workloads and the Intel MIC on the way for similar tasks, the logical direction in which the application of CPUs should shift is memory-intensive problems. Memory-intensive problems is something that GPUs are not good at by design. It is therefore surprising that the recent advances in hardware configuration are made in the opposite direction, shrinking the amount of memory per core. While it may take some time for big computing to assess the financial and environmental impact of large-memory computers and adjust the strategies for HPC development, today’s users and owners of smaller computing clusters can benefit from large memory machines in a number of ways, with communication avoiding algorithms being one of the most lucrative.