In this paper we present and explore the performance of Landau gauge fixing in GPUs using CUDA. We consider the steepest descent algorithm with Fourier acceleration, and compare the GPU performance with a parallel CPU implementation. Using 324 lattice volumes, we find that the computational power of a single Tesla C2070 GPU is equivalent to approximately 256 CPU cores.
We have compared the performance of the GPU implementation of the Fourier accelerated steepest descent Landau gauge fixing algorithm using CUDA with a standard MPI implementation built on the Chroma library. The run tests were done on 32 4 and β = 5.8, 6.0, 6.2 pure gauge configurations, generated using the standard Wilson action.
In order to optimize the GPU code, its performance was investigated on 323 x n gauge configurations. The runs on a C2070 Tesla show that, for a 4D dimension lattice, the best performance was achieved with 2D plus 2D FFTs using cufftPlanMany() and using a 12 real parameter reconstruction with texture memory. From all the runs using a C2070 Tesla GPU, peak performance was measured as 186/71 GFlops for single/double precision. From the performance point of view, a run on a single GPU delivers the same performance as the CPU code when running on 32 nodes (256 cores), if one assumes a linear speed-up behavior.
As for the generation of gauge configurations, the use of GPUs reduces considerably the time of a simulation. Presently, the main limitation of the GPUs both for gauge generation and for Landau gauge fixing is its limited memory size. For the Tesla C2070, the memory is 6GB. This allows to consider lattice volumes up to 564 using 12 real number parametrization in double precision to perform gauge fixing on a single GPU. In order to consider larger lattice volumes on GPU architectures, one has to consider a multi-GPU implementation. We leave this for a future work.