Graphics processing units (GPUs) are recently being used to an increasing degree for general computational purposes. This development is motivated by their theoretical peak performance, which significantly exceeds that of broadly available CPUs. For practical purposes, however, it is far from clear how much of this theoretical performance can be realized in actual scientific applications. As is discussed here for the case of studying classical spin models of statistical mechanics by Monte Carlo simulations, only an explicit tailoring of the involved algorithms to the specific architecture under consideration allows to harvest the computational power of GPU systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering simulations are discussed. Significant speed-ups by factors of up to 1000 compared to serial CPU code as well as previous GPU implementations are observed.
Current GPUs have a significant potential for speeding up scientific calculations as compared to the more traditional CPU architecture. In particular, this applies to the simulation of systems with local interactions and local updates, where the massive parallelism inherent to the GPU design can work efficiently thanks to appropriate domain decompositions. The simulation of lattice spin systems appears to be a paradigmatic example of this class as the decomposition remains static and thus no significant communication overhead is incurred. Observing substantial speedups by factors often exceeding two orders of magnitude as for the case studies reported here requires a careful tailoring of implementations to the peculiarities of the considered architecture, however, in particular paying attention to the hierarchic organization of memories (including more exotic ones such as texture memory), the avoidance of branch divergence and the choice of thread and block numbers commensurate with the capabilities of the cards employed. For achieving good performance, it is crucial to understand how these devices hide the significant latency of accessing those memories that do not reside on die through the interleaved scheduling of a number of execution threads significantly exceeding the number of available computing cores. It is encouraging that employing such devices with the rather moderate coding effort mediated by high-level language extensions such as NVIDIA CUDA or OpenCL updating times in the same ballpark as those of special purpose machines such as Janus with a vastly higher development effort can be reached.
Martin Weigel, Performance potential for simulating spin models on GPU, Journal of Computational Physics, Available online 16 December 2011, [DOI: 10.1016/j.jcp.2011.12.008] Reprint: arXiv:1101.1427v1 [physics.comp-ph]