The use of high-performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider the performance and auto-tuning of level 1 and level 2 BLAS routines on graphical processing units. As examples, we develop single-precision Compute Unified Device Architecture kernels for three of the most popular operations, the Euclidian norm (SNRM2), the matrix–vector multiplication (SGEMV), and the triangular solution (STRSV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for high-performance computing. We show that it is essentially a matter of fully utilizing the fine-grained parallelism of the many-core graphical processing unit to achieve high performance for level 1 and level 2 BLAS operations. We show that auto-tuning can be successfully employed to kernels for these operations so that they perform well for all input sizes.
In this work, we have implemented level 1 and level 2 BLAS operations as high-performance GPU kernels designed for auto-tuning. We used auto-tuning of the kernels in order to select the optimal kernel parameters on the Tesla C2050 card (Fermi GPU). The auto-tuning consisted of an exhaustive search of the tuning parameter space containing key hardware dependent parameters that
sets the number of threads per block, the work per thread, the unroll level of the inner-most loop, and algorithmic trade-offs.
We implemented our auto-tuner and kernels based on C++ function templates, which are supported in the CUDA programming model, and allows for meta-programming kernels, where the tuning arguments are evaluated at compile time. The proposed auto-tuning procedure then required at most 512 different configurations for a particular kernel to be compiled and benchmarked for a given input or on a tuning grid. Furthermore, we used empirical tests and knowledge about the hardware to prune the parameter space before the auto-tuning, which effectively reduced the autotuning runtime to a matter of minutes for all the level 1 and level 2 BLAS kernels.
We have illustrated the approach for the Level 1 BLAS routines, with the example of the Euclidian norm (SNRM2), and for the Level 2 BLAS routines in the case of the matrix-vector multiplication (SGEMV) and the triangular solve with one right-hand side (STRSV). In all cases we achieved significantly better performance than the similar routines in the CUBLAS 4.0 library.
In order to evaluate the performance of level 1 and level 2 BLAS kernels, which are inherently memory bound, we proposed that the measurement of the effective bandwidth for a given input size is also the best metric for estimating the maximum performance that can be reached. In practice, the estimates showed that the SNRM2 and SGEMV kernels are relatively close to the best we can hope for, whereas the STRSV kernel is far from the maximum performance. In other words, there is still room for improvement in the development of parallel GPU algorithms for triangular solvers.