We describe FALCON, an original open-source implementation of image convolution with a 3×3 filter based on Winograd’s minimal filtering algorithm. Compared to direct convolution, Winograd’s algorithm reduces the number of arithmetic operations at the cost of complicating the memory access pattern. This study is carried out in the context of image analysis in convolutional neural networks.
Our implementation combines C language code with BLAS function calls for general matrix-matrix multiplication. The code is optimized for Intel Xeon Phi processors x200 (formerly Knights Landing) with Intel Math Kernel Library (MKL) used for BLAS call to the SGEMM function.
To test the performance of FALCON in the context of machine learning, we benchmarked it for a set of image and filter sizes corresponding to the VGG Net architecture. In this test, FALCON achieves 10% greater overall performance than convolution from DNN primitives in Intel MKL. However, for some layers, FALCON is faster than MKL by 1.5x, but for other layers slower by as much as 4x. This indicates a possibility of a hybrid implementation with fast and direct convolution for a 30% speedup. High-bandwidth memory (MCDRAM) in Intel Xeon Phi x200 product family is a significant factor in the efficiency of the fast convolution algorithm.
Table of Contents
- Section 1. Convolution in Machine Learning
- Section 2. Winograd's Minimal FIR Filtering
- Section 3. Application to ConvNets
- Section 4. Transformation to GEMM
- Section 5. Implementation in Code
- Section 6. The FALCON Library
- Section 7. Performance
- Section 8. Conclusion
Why are some sections grayed out? You are viewing an abridged version of this publication.
To view the full version and download a printable PDF, please register or log in.
Section 1. Convolution in Machine Learning
Applications of deep neural networks (DNNs) to machine learning are diverse and promptly emerging, reaching the fields of assistive technologies, commerce, fundamental sciences, medical imaging and security (see, e.g., this article). DNNs thrive with abundant data. As a consequence, training DNNs often requires expensive development time and powerful computing resources. Therefore, even small improvements in the efficiency of the fundamental building blocks of DNNs can benefit the field of machine learning.
In image analysis with DNNs, one building block has gained particular importance in recent years: the operation of convolution of images with a filter. This operation is used in convolutional DNNs (ConvNets), which rely on the mathematical operation of convolution for position-independent object identification in images (see this article).
Numerically, convolution may be performed directly. This method is expensive in terms of computational complexity. For an image of size and filter of size , direct convolution requires operations. However, arithmetic operations in direct convolution can easily be collapsed to form the general matrix-matrix multiplication (GEMM) pattern (see this article). This simplifies the design of convolution functions because the complexity of memory and cache traffic management is delegated to the implementation of GEMM. Efficient GEMM code exists in Basic Linear Algebra Subroutine (BLAS) libraries for nearly every computer architecture. In the case of Intel architecture, Intel Math Kernel Library (MKL) has highly efficient implementation of GEMM and of direct convolution expressed with matrix-matrix multiplication.
At the same time, it is possible to compute convolution with alternative methods that perform fewer arithmetic operations than the direct method. For example, fast Fourier transform (FFT) may be used to compute image convolution with complexity (see this book). The asymptotic behavior of this algorithm predicts fewer operations than in direct method only if the filter is large enough: . However, this approach is not useful for ConvNets because they typically use filters as small as 2×2 or 3×3 pixels. In this range, the performance of the FFT method is poor compared to the direct method. In the domain of small filters, Winograd’s minimal filtering algorithm may be a better choice (see 1, 2). This approach has the same asymptotic complexity as the direct method, , but it reduces the number of operations by a constant factor. In this paper we present the implementation of convolution based on Winograd’s minimal filtering algorithm for a filter size .
From here on, we refer to the convolution algorithm based on Winograd’s algorithm as “fast convolution”. This term, chosen by analogy with fast Fourier transform, signifies the algorithm performs fewer floating-point operations than the direct approach. At the same time, it is not trivial to implement a computer program performing “fast” convolution in less time than the direct method. This is because the fast algorithm requires data transformation with a complex memory access pattern, making it more difficult to express efficiently in code. The choice between an expensive yet simple algorithm (direct convolution) and less expensive but complicated algorithm (fast convolution) is not straightforward. It is difficult to predict, for instance, how well the hierarchy of memory and caches is able to serve the complex data access pattern of an algorithm based on strided or indirect memory access.
Section 2. Winograd’s Minimal FIR Filtering
The original Winograd’s algorithm is applied to the computation of finite impulse response (FIR) filters. Direct application of 2 consecutive steps of a 3-tap FIR filter with coefficients to a set of 4 elements requires 6 additions and 6 multiplications:
The idea of Winograd’s method is to compute these two filter outputs as
If we precompute the expressions and , then this procedure requires 8 additions and 4 multiplications, which is equal to number of floating point operations in the direct method. However, if our goal is to apply multiple filters gi to the same data di, then we can also precompute , , and . With this done, the computation of F0 and F1 would only require 4 additions and 4 multiplications, yielding a speedup of (6+6)/(4+4)=1.5.
Section 3. Application to ConvNets
In the context of ConvNets, the operation of convolution applies a total of filters of size to a batch of images of size with channels in each. We enumerate filters with , and channels with . Each image we split into tiles and enumerate these tiles within the image with t, which ranges from to . The images within a batch are enumerated with n, which ranges from to .
Direct convolution of a filter with a image to generate a output tile requires multiplications and additions. As shown by Lavin and Gray, Winograd’s fast FIR filter computation can be generalized to 2D filters, which are mathematically similar to convolution. The fast method can be expressed as shown below:
where is a matrix representing the image tile, is a matrix representing channel of filter , is a matrix with the output of the convolution of with ,
In the above equation, the symbol indicates element-wise multiplication. Assuming that we have many image tiles and multiple filters, we can precompute transformed image tiles and transformed filters . With that done, the algorithm requires multiplications between the transformed input and filters. For inverse transformations, 24 additions are required. This is already times fewer operations than in the direct method. In addition to this, for the purposes of ConvNets, convolution must be applied to image channels, and results for individual channels must be summed:
This allows one to perform the summation over channels first (16 additions per tile per channel) and apply inverse transformation after the summation,
thereby making its contribution to the operation count negligible. This results in net savings in the number of operations by a factor of (we factored in the need to do 4 additions per channel in the direct method).
The expectation of speedup in the fast algorithm due to reduced number of operations hinges on the assumption that the precomputation of the forward and backward transformation of data takes little time. In reality, our experiments revealed that the straightforward implementation of the above algorithm does not provide high performance, and platform-specific optimization is required.
Section 4. Transformation to GEMM
In the above reasoning, we were making a silent assumption that the arithmetic throughput is the limiting factor of performance, i.e., memory traffic is completely overlapped with computation. This assumption holds true only if the order of tile convolutions is tuned to effectively re-use data in the processor’s caches. To avoid this complexity, as shown by Lavin and Gray, we we can express the arithmetic operations in the equations of the previous section as matrix multiplication, which allows us to delegate the complexity of overlapping memory traffic and computation to the GEMM function of a BLAS library.
Expressing the calculations through GEMM is possible because a transformed input tile can be reused to multiply with multiple corresponding filter tiles, and, similarly, a transformed filter tile can be reused to multiply with corresponding input tiles across all the batches. In addition, the input and filter tiles that are multiplied across channels are accumulated into one output tile. Denoting the elements of the matrix as , and denoting elements of as , we can define a matrix with elements
For each pair , this equation expresses the multiplication of matrix by matrix . The final result of convolution can then be written as
Section 5. Implementation in Code
To compute convolution with the fast algorithm we follow approach similar to that shown by Lavin and Gray. There are three stages in this algorithm.
- Input transformation: scattering on the image and filter data sets to form the input matrices.
- Computation of product between transformed data and filter, and summation over channels expressed as GEMM, and
- Output transformation: gathering the elements from the product matrices and their transformation to form the actual output of the convolution
Data transformation and the procedure for expressing the computation with GEMM are explained thoroughly by Lavin and Gray. We have modified the data layout and way the input matrices for GEMM are formed and the pseudo codes shown below illustrate the procedure.
The input data format is flexible and can be tuned with the help of merge factor to achieve high GEMM performance. Here, results in format, results in format, and shuffles and keeping as fixed inner dimensions.
You are viewing an abridged version of this publication.
To continue reading and download a printable PDF, please register or log in.