The CUDA model for graphics processing units (GPUs) presents the programmer with a plethora of different programming options. These includes different memory types, different memory access methods and different data types. Identifying which options to use and when is a non-trivial exercise. This paper explores the effect of these different options on the performance of a routine that evaluates sparse matrix–vector products (SpMV) across three different generations of NVIDIA GPU hardware. A process for analysing performance and selecting the subset of implementations that perform best is proposed. The potential for mapping sparse matrix attributes to optimal CUDA SpMV implementations is discussed. Copyright © 2011 John Wiley & Sons, Ltd.
In this work we have taken a relatively simple piece of code for performing SpMV and shown how it can give rise to a large number of different implementations when using CUDA on a GPU. Moreover, the variation in performance between these different implementations was found to be quite large. Although the number of implementations was large as was the number of test matrices, both generation of the implementations and performance testing are operations that are amenable to automation.
From the large number of implementations and extensive performance testing we introduced the concept of a BPS. This was defined incrementally using a greedy type algorithm. It is important to note that for anything larger than 1 and less than the maximum BPS size, the BPS may not contain those routines that give the best possible performance. That is BPS 2 must contain the routine in BPS 1, and is not free to chose any two routines even if this may give better overall performance. This approach was taken in order to avoid the factorial blow out in the number of permutations that must be evaluated for the large BPS sizes.
In this work the performance of one implementation was considered to be better than another as soon as the measured performance was greater. There was no concept of performance being significantly better. Attempting to group implementations that give roughly the same performance for a give matrix may provide a means of reducing the number of implementations that need to
The sensitivity of the BPS to the nature of the matrices used to construct it was explored. The results showed some differences depending on the size of the sparse matrices used in the test suite and their uniformity, but the relationship was not strong.
From the BPS we attempted to construct a simple decision tree that maps matrix attributes to optimal implementation. The approach used here was crude, and based on a minimal number of matrix attributes. The results however show some promise. The use of more elaborate matrix attributes (such as average number of neighbours) may improve this mapping. In this respect further investigation of the sensitivity of the BPS to the characteristics of the test matrices and the capabilities of the underlying hardware may be advantageous, as might be the use of machine learning techniques to provide a more sophisticated mapping function. It may also be possible to use reinforcement learning techniques that do not rely on pre-learning a set of rules to develop a BPS that can dynamically adapt to changes in runtime conditions .
In terms of performance, the GFLOP/s reported here may seem a little low given the high theoretical peak of the GPU. It should be remembered that the GFLOP/s given are effective GFLOP rates that involve converting the vec array from double to float and some transfer of data between the GPU and main memory. They also relate only to useful floating point operations, not to additional operations that might be performed because of padding of the various vectors. It should also be noted that the performance given for the GTX 295 used just one of the two GPUs available on this card.
Finally, the goal of the work presented here is to investigate strategies for automating the development of an optimized SpMV routine for a GPU system given the large number of implementation
options available and the rapid evolution of GPU hardware. Comparing the performance of those optimized routines running on a GPU with equivalent optimized routines running on a contemporary
multicore CPU is another matter that is beyond the scope of this paper. It is interesting to note, however, that the recent work of Lee et al.  suggests that the performance advantage of GPUs for some applications may not be as large as many people believe.
Ahmed H. El Zein and Alistair P. Rendell. Generating optimal CUDA sparse matrix–vector product implementations for evolving GPU hardware. Concurrency and Computation: Practice and Experience, Volume 24, Issue 1, pages 3–13, January 2012. [doi: 10.1002/cpe.1732] [Free PDF]