Starting from the single graphics processing unit (GPU) version of the Smoothed Particle Hydrodynamics (SPH) code DualSPHysics, a multi-GPU SPH program is developed for free-surface flows. The approach is based on a spatial decomposition technique, whereby different portions (sub-domains) of the physical system under study are assigned to different GPUs. Communication between devices is achieved with the use of Message Passing Interface (MPI) application programming interface (API) routines. The use of the sorting algorithm radix sort for inter-GPU particle migration and sub-domain “halo” building (which enables interaction between SPH particles of different sub-domains) is described in detail. With the resulting scheme it is possible, on the one hand, to carry out simulations that could also be performed on a single GPU, but they can now be performed even faster than on one of these devices alone. On the other hand, accelerated simulations can be performed with up to 32 million particles on the current architecture, which is beyond the limitations of a single GPU due to memory constraints. A study of weak and strong scaling behavior, speedups and efficiency of the resulting program is presented including an investigation to elucidate the computational bottlenecks. Last, possibilities for reduction of the effects of overhead on computational efficiency in future versions of our scheme are discussed.
This paper presented a computational methodology to carry out three-dimensional, massively parallel Smoothed Particle Hydrodynamics (SPH) simulations across multiple GPUs. This has been achieved by introducing a spatial domain decomposition scheme into the single-GPU part of the DualSPHysics code, converting it into a multi-GPU SPH program.
By using multiple GPUs, on the one hand, simulations can be performed that could also be performed on a single GPU, but can now be obtained even faster than on one of these devices alone, leading to speedups of several hundred in comparison to a single-threaded CPU program. On the other hand accelerated simulations with tens of millions of particles can now be performed, which would be impossible to fit on a single GPU due to memory constraints in a full-on-GPU simulation where all data resides on the GPU. By being able to simulate—without the need of large, expensive clusters of CPUs—this large number of particles at speeds well beyond one hundred times faster than single CPU programs, our software has the potential to bypass limitations of system size and simulation times which have been constraining the applicability of SPH to various engineering problems.
The methodology features the use of MPI routines and the sorting algorithm radix sort, for the migration of particles between GPUs as well as domain ’halo’ building. A study of weak and strong scaling with a slow Ethernet connection shows that inter-CPU communications are likely to be the bottleneck of our simulations, but considerable overhead is also produced by data preparation, and, to a lesser extent, by GPU-CPU data transfers. A possible solution to the overhead caused by the latter is the use of pinned memory, which so far in our program remains unused. The use of Infiniband instead of Ethernet should reduce the overhead cause by inter-CPU communications, and for the case of GPUs residing on the same host, the use of the recently released CUDA 4.0 will be introduced, which should further accelerate communications. Future work includes also the introduction of a dynamic load balancing algorithm, a multi-dimensional domain decomposition scheme, as well as floating body capabilities.
Daniel Valdez-Balderas, José M. Domínguez, Benedict D. Rogers and Alejandro J.C. Crespo. Towards accelerating smoothed particle hydrodynamics simulations for free-surface flows on multi-GPU clusters. Journal of Parallel and Distributed Computing, 2012. [doi: 10.1016/j.jpdc.2012.07.010] [Free PDF]