We present a comparison of several modern C++ libraries providing high-level interfaces for programming multi- and many-core architectures on top of CUDA or OpenCL. The comparison focuses on the solution of ordinary differential equations and is based on odient, a framework for the solution of systems of ordinary differential equations. Odeint is designed in a very flexible way and may be easily adapted for effective use of libraries such as Thrust, MTL4, VexCL, or ViennaCL, using CUDA or OpenCL technologies. We found that CUDA and OpenCL work equally well for problems of large sizes, while OpenCL has higher overhead for smaller problems. Furthermore, we show that modern high-level libraries allow to effectively use the computational resources of many-core GPUs or multi-core CPUs without much knowledge of the underlying technologies.
Performance-wise, there is almost no difference between various platforms and libraries when those are run on the same hardware for large problem sizes. As we have shown, various computational problems may be solved effectively in terms of both human and machine time with the help of modern high-level libraries. Hence, the differences in the programming interfaces of the libraries are more likely to determine the choice of a particular library for a specific application rather than raw performance.
The focus of Thrust is more on providing low-level primitives with an interface very close to the C++ STL library. Special purpose functionality is available via separate libraries such as CUSPARSE and can be integrated without a lot of effort. The rest of the libraries we looked at demonstrated that they are able to provide a more convenient interface for a scientific programmer than a direct implementation in CUDA or OpenCL. MTL4 and VexCL have a richer set of element-wise vector operations and allow for the shortest implementations in the context of the ODEs considered in this work. ViennaCL required a few additional lines of code including a small custom OpenCL kernel for one of our examples. Still, this extra effort is acceptable considering that the library’s focus is on sparse linear systems solvers, which are, however, beyond the scope of this paper.
Regarding a comparison of CUDA versus OpenCL, the main difference observed in this work is the wider range of hardware supported by OpenCL. Although the performance obtained via CUDA is a few percent better than that of OpenCL on the overall, the differences are mostly too small in order to make a decision in favor of CUDA based on performance only. Moreover, the slight performance advantage of CUDA can still turn into a disadvantage when taking the larger set of hardware supporting OpenCL into consideration.
Another aspect that has not been studied in this work is the ability to generate kernels optimized for the problem at hand at runtime. This allows, for example, to generate optimized kernels for a certain set of parameters supplied, eliminating any otherwise spurious reads from global memory. An in-depth study of such an approach is, however, left for future work.