Comparison of GPU architectures for asynchronous communication with finite-differencing applications
Graphical processing units (GPUs) are good data-parallel performance accelerators for solving regular mesh partial differential equations (PDEs) whereby low-latency communications and high compute to communications ratios can yield very high levels of computational efficiency. Finite-difference time-domain methods still play an important role for many PDE applications. Iterative multi-grid and multilevel algorithms can converge faster than ordinary finite-difference methods but can be much more difficult to parallelize with GPU memory constraints. We report on some practical algorithmic and data layout approaches and on performance data on a range of GPUs with CUDA. We focus on the use of multiple GPU devices with a single CPU host and the asynchronous CPU/GPU communications issues involved. We obtain more than two orders of magnitude of speedup over a comparable CPU core.
Conclusions
We have investigated the potential of a single CPU host for optimally managing multiple GPU accelerator devices. This approach is likely to be of great importance in designing future computer clusters. We have focused on finite-difference time-domain applications as our representative benchmark problem and have extended our previous work to now use three-device configurations and also to use NVIDIA’s currently available device with the most cores—the GTX480.
We have presented two methods for implementing finite-differencing field-equation simulations on multiple GPUs. The implementations we have discussed and presented performance data on all operate on a single host machine containing two or three GPU devices. We have shown that the correct use of asynchronous memory communication can reliably provide linear speed up over a single-GPU implementation and on NVIDIA’s recent Fermi architecture GPUs the speed up can sometimes even be super-linear due to this device now having cache memory. We have found that with a carefully configured system, synchronous communication can actually provide comparable performance but is very sensitive to any thread synchronization delay effects.
While GPGPU has opened the door to cheap and powerful data-parallel architectures, these devices do not necessarily have the same scalability of cluster machines. However, good scalability can be achieved by extending a single host system to a many-GPU cluster. In such a cluster each node is then itself a highly data-parallel machine containing one or more GPU devices.
The system described in this work performs communication between devices using CPUmemory, thus the communication latency is relatively low. In a multiple GPU-accelerated cluster, proper use of asynchronous communication becomes even more vital to fully utilize computing resources.
For the work we report in this paper, each GPU device has around 1–1.5GByte of memory. These ‘gamer-level’ devices are considerably cheaper than the ‘professional-level’ devices available, such as the C2070 which has 6GBytes of memory. It remains an interesting tradeoff space to decide whether to configure a compute cluster with more, cheaper cards or with few, well-configured ones. We plan on experimenting with more GPUs within a single hosted machine and also examining approaches for decomposing our simulations across clusters with non-homogeneous, GPU accelerated nodes. This will include possible methods for pipelining communications for such systems to improve compute/communication ratios.
In summary, CUDA-programmed multiple GPUs offer an attractive route for building future generation cluster nodes. This is particularly worthwhile if the more esoteric CUDA features are exploitable by the application programmer, where as we have shown, two orders of magnitude speed up over a current generation single CPU core are possible.
D. P. Playne and K. A. Hawick. Comparison of GPU architectures for asynchronous communication with finite-differencing applications. Concurrency and Computation: Practice and Experience. Volume 24, Issue 1, pages 73–83, January 2012. [doi: 10.1002/cpe.1726] [Free PDF]
Category: Articles, Computer Science





