We report on our implementation of the RHMC algorithm for the simulation of lattice QCD with two staggered flavors on Graphics Processing Units, using the NVIDIA CUDA programming language. The main feature of our code is that the GPU is not used just as an accelerator, but instead the whole Molecular Dynamics trajectory is performed on it. After pointing out the main bottlenecks and how to circumvent them, we discuss the obtained performances. We present some preliminary results regarding OpenCL and multiGPU extensions of our code and discuss future perspectives.
The extremely high computation capabilities of modern GPUs make them attractive platforms for high-performance computations. Previous studies on lattice QCD applications have been devoted almost exclusively to the Dirac matrix inversion problem. We have shown that it is possible to use GPUs to efficiently perform a complete simulation, without the need to rely on more traditional architectures: in this case the GPU is not just an accelerator, but the real computer.
Our strategy therefore has been that of bringing as much as possible of the computations on the GPU, leaving for the CPU only light or control tasks: in particular the whole molecular dynamics evolution of gauge fields and momenta, which is the most costly part of the Hybrid Monte Carlo algorithm, runs completely on the GPU, thus reducing the costly CPU-GPU communications at the minimum.
Following such strategy, we have developed a single GPU code based on CUDA and tested it on C1060 and C2050 architectures. We have been able to reach boost factors up to ∼ 102 as compared to what reached by a twin traditional C++ code running on a single CPU core. Our code is currently in use to study the properties of strong interactions at finite temperature and density and the nature of the deconfinement transition, in particular first production results have been reported in Ref. .
A point worth noting is that in our implementation we have to rely on the reliability of the GPU. If the GPU is used just for the Dirac matrix inversion the result can then be directly checked on the CPU without introducing significant overhead in the computation. Such a simple test can not be performed if the GPU is used to perform a complete MD trajectory. For this reason it was mandatory to use GPUs of the Tesla series.
During the editorial processing of this paper it was signaled us that also the TWQCD Collaboration uses GPUs to completely perform the Hybrid Monte Carlo update of QCD with optimal Domain Wall fermions (their first results were published in ).
Reported performances make surely GPUs the preferred choice for medium size lattice groups who need enough computational power at a convenient cost, in this sense they already represent a breakthrough for the lattice community. Our current lines of development regard the extension of our code to OpenCL and to multiGPU architectures and we have reported preliminary results about that in Section 5: that will open to possibility to use GPU clusters with fast connection links (see for instance Ref. ) in order to make the GPU technology available also for large scale simulations.