The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more prevalent due to these advantages. In this paper, we present a continuation of previous work implementing algorithms for using accelerators into the LAMMPS molecular dynamics software for distributed memory parallel hybrid machines. In our previous work, we focused on acceleration for short-range models with an approach intended to harness the processing power of both the accelerator and (multi-core) CPUs. To augment the existing implementations, we present an efficient implementation of long-range electrostatic force calculation for molecular dynamics. Specifically, we present an implementation of the particle–particle particle-mesh method based on the work by Harvey and De Fabritiis. We present benchmark results on the Keeneland InfiniBand GPU cluster. We provide a performance comparison of the same kernels compiled with both CUDA and OpenCL. We discuss limitations to parallel efficiency and future directions for improving performance on hybrid or heterogeneous computers.
Percentage of loop time spent on short range non-bonded force calculation (Short Range), bond, angle, dihedral, and improper forces (Bond), neighbor list builds (Neigh), MPI communication excluding Poisson solves (Comm), P3M charge assignment (Charge Assign), P3M field solves (Field solve), P3M force interpolation (Force Interp), time integration, statistics, and other calculations (Other). Percentages are taken from a strong scaling benchmark (1 to 32 cores) with the 32 000 atom rhodopsin simulation.
Comparison of short- range pairwise force calculation times on the accelerator (single precision) using 1 or 8 threads per atom with computation times using only the CPU. Timings represent strong scaling results using the rhodopsin benchmark. Only 1 CPU core is used per GPU for these timings.
Speedups for the charge assignment routines (upper) and force interpolation (lower) with use of GPU acceleration. Speedups are shown for order 4, 5, and 6 splines in single (SP) and double (DP) precision. The numbers are from strong scaling timings using the 32 000 particle rhodopsin benchmark and compare times with 1 CPU core per GPU. Host-device data transfer is included in the timings. Speedup for the charge assignment includes mesh initialization, particle mapping, and charge assignment kernels.
Strong scaling timings for long-range force calculations representing the sum of the execution times for charge assignment and force interpolation on the GPU and field solves performed on the CPU. Timings are for order 4 (30 × 40 × 36 mesh), 5 (25 × 32 × 32 mesh), and 6 (24 × 32 × 30 mesh) splines using single (SP) and double (DP) precision GPU calculations. The timings for CPU-only long-range calculations with an order 5 spline are shown for reference. Comparisons use 1 CPU core per GPU.
Percentage of GPU P3M time required for mapping particles to mesh points (Map), performing charge assignment (Assign), performing force interpolation (Interp), and host-device data transfer. Percentages are taken from a strong scaling benchmark with the 32 000 atom rhodopsin simulation with order-5 splines for P3M.
Performance comparison of CUDA and OpenCL for the same kernels using the rhodopsin benchmark. Timings are for GPU kernels and host-device transfer only.
Strong scaling timings for the entire simulation loop run entirely on the CPU (CPU) or with GPU acceleration in single, mixed, or double precision. Timings are for the rhodopsin benchmark. GPU acceleration for mixed was performed using mixed precision for short-range force calculation and double precision for long-range force calculations. CPU only timings were performed using only 1 CPU core per socket. Single and mixed precision timings are very similar.
Breakdown of the mixed precision loop times in Fig. 7 into time spent for various routines on the CPU (top) and on the GPU (bottom). In both cases, percentages represent the fraction of the wall time required to complete the simulation loop. “Idle” timings represent the percent time that the CPU or accelerator are not performing computations while waiting for data. Certain host-device data transfers, short range pairwise force calculations, and charge assignment can be performed concurrently with bond, angle, dihedral, and improper force calculations.
Simulation results for the rhodopsin benchmark on the Keeneland GPU cluster. Simulations are run using 1 to 32 nodes using 3 GPUs on each node. Simulations are run using 3, 6, or 12 MPI processes on each node. Results for the rhodopsin benchmark scaled to 256 000 particles are also shown. (Upper) Wall clock times for the simulation loops. (Lower) Percent time required for various routines as measured on the host for the GPU 6PPN simulations. Percent times for concurrent GPU calculations are not shown.
W. Michael Browna, Axel Kohlmeyerb, Steven J. Plimptonc, Arnold N. Tharringtona. Implementing molecular dynamics on hybrid high performance computers – Particle–particle particle-mesh. Computer Physics Communications. Volume 183, Issue 3, March 2012, Pages 449-459. [doi:10.1016/j.cpc.2011.10.012]