We adopt CUDA-capable Graphic Processing Units (GPUs) for Coulomb, Landau and maximally Abelian gauge fixing in 3+1 dimensional SU(3) lattice gauge field theories. The local overrelaxation algorithm is perfectly suited for highly parallel architectures. Simulated annealing preconditioning strongly increases the probability to reach the global maximum of the gauge functional. We give performance results for single and double precision. To obtain our maximum performance of ~300 GFlops on NVIDIA’s GTX 580 a very fine grained degree of parallelism is required due to the register limits of NVIDIA’s Fermi GPUs: we use eight threads per lattice site, i.e., one thread per SU(3) matrix that is involved in the computation of a site update.
We presented a CUDA implementation for gauge fixing on the lattice based on the relaxation algorithms. In particular, our code can be used to fix gauge field configurations to Landau, Coulomb or the maximally Abelian gauges using simulated annealing, overrelaxation or stochastic relaxation. Using a fine parallelization granularity of eight CUDA threads per lattice site we achieve a maximum performance of 300 Gflops in single precision on NVIDIA’s GTX 580. Comparing this to the performance of the overrelaxation algorithm as implemented in the FermiQCD library run on the Intel Core i7-950 (“Bloomfield”) quadcore processor @ 3.07GHz in parallel using MPI, we find a speedup of more than 150. Our code will be available for download shortly.