We have recently carried out a computational campaign to investigate a model of coronal heating in three-dimensions using reduced magnetohydrodynamics (RMHD). Our code is built on a conventional scheme using the pseudo-spectral method, and is parallelized using MPI. The current investigation requires very long time integrations using high Lundquist numbers, where the formation of very fine current layers challenge the resolutions achievable even on massively parallel machines. We present here results of a port to Nvidia CUDA (Compute Unified Device Architecture) for hardware acceleration using graphics processing units (GPUs). In addition to a brief discussion of our general strategy, we will report code performance on several machines which span a variety of hardware configurations and capabilities. These include a desktop workstation with commodity hardware, a dedicated research workstation equipped with four Nvidia C2050 GPUs, as well as several large-scale GPU accelerated distributed memory machines: Lincoln/NCSA, Dirac/NERSC, and Keeneland/NICS.
Our strategy for a comprehensive reprogramming of RMCT for GPU acceleration with CUDA can be summarized as follows:
 Perform FFTs using the CUFFT library. The library is about an order of magnitude faster than our original FFT implementation when not considering CPU-GPU memory transfers but only several times faster when including them. We aim therefore to maximize the number of FFTs per memory transfer, and to perform intermediary tasks on the GPU.
 Recycle memory of intermediate quantities. There is limited memory available on the GPU board so one must adequately budget memory allocated on board in such a manner as to observe.
 Write simple kernels for point-wise arithmetic.
 Preserve the underlying MPI decomposition. We are dealing here with two levels of parallelization, the first being the domain decomposition in z and the second being the massive parallelism afforded by GPUs.