We present a new implementation of the numerical integration of the classical, gravitational, N-body problem based on a high order Hermite’s integration scheme with block time steps, with a direct evaluation of the particle-particle forces. The main innovation of this code (called HiGPUs) is its full parallelization, exploiting both OpenMP and MPI in the use of the multicore Central Processing Units as well as either Compute Unified Device Architecture (CUDA) or OpenCL for the hosted Graphic Processing Units. We tested both performance and accuracy of the code using up to 256 GPUs in the supercomputer IBM iDataPlex DX360M3 Linux Infiniband Cluster provided by the italian supercomputing consortium CINECA, for values of N up to 8 millions. We were able to follow the evolution of a system of 8 million bodies for few crossing times, task previously unreached by direct summation codes. The code is freely available to the scientific community.
Composite architectures based on computational nodes hosting multicore Central Processing Units connected to one or more Graphic Processing Units have been shown by various authors to be a clever and efficient solution to supercomputing needings in different scientific frameworks. Actually, these architectures are characterized by a high ratio between performance and both installation cost and power consumption. A practical proof of this is that some of the most powerful systems in the TOP 500 list of world’s supercomputer are based on that scheme. They are indeed a valid alternative to massively parallel multicore systems, where the final computational power comes by the use of a very large number of CPUs, although each of them has a relatively low clock frequency. It is quite obvious that a full exploit of the best performance of the CPU+GPU platforms requires codes that clearly enucleate a heavy computational kernel, to be assigned in parallel to the GPUs acting as ‘number crunchers’ which release, periodically, their results to the hosts. In physics, the study of the evolution of systems of objects interacting via a potential depending on their mutual distance falls into this category.
In this paper we presented and discussed a new, high precision, code apt to simulating the time evolution of systems of N point masses interacting with the classical, pair, newtonian force. The high precision comes from both the evaluation by direct summation of the pairwise force among the system bodies and by a proper treatment of the multiple space and time scales of the system, which means resorting to an individual time-stepping procedure and resynchronizations, as well as using a high order (6th) time integrator.
In this paper we discussed the implementation of our fully parallel version of a direct summation algorithm whose O(N2) computational complexity is dealt with by GPUs acting as computational accelerators in the hosting nodes where multicore CPUs are governed and linked via MPI directives. The code, called HermiteIntegratorGPUs (HiGPUs, available to the scientific community on astrowww.phys.uniroma1.it/dolcetta/HPCcodes/HiGPUs. html or in the frame of the AMUSE project on amusecode.org) shows a very good performance both in term of scaling and efficiency in a good compromise between precision (as measured by energy and angular momentum conservations) and speed. We performed an extensive set of test simulation as benchmarks of our code using the PLX composite cluster of the CINECA italian supercomputing inter-university consortium. We find that the integration of an N = 8, 000, 000 bodies is done with an 80% efficiency, that is a deviation of just 20% from the linear speedup when using 256 nVIDIA Tesla M2070. This corresponds to less than 10 hours of wall clock time to follow the evolution of the 8M body system up to one internal crossing time, performance, at our knowledge, never reached for such kind of simulations. This means that with HiGPUs it is possible to follow the evolution of a realistic model of Globular Cluster (a spherical stellar system orbiting our and other galaxies and composed by about 1,000,000 stars packed in a sphere of about 10 pc radius) with a 1:1 correspondence between number of real stars in the system and simulating particles over a length of about 10 orbital periods around the galactic center in few days of simulation. These kind of simulations will allow, for instance, a thorough investigation of open astrophysical questions that may involve, in their answer, the role of globular clusters and globular cluster systems in galaxies. We cite the open problem of the origin of Nuclear Clusters as observed in various galaxies, like our Milky Way. Some authors (e.g.,  and ) suggested a dissipational, gaseous origin while others (, ) indicate as more realistic a dissipation less origin by orbital decay and merging of globular clusters, as already numerically tested in  and ).
One limit in the use of our code is the GPU memory: with a 6GB RAM, as in the case of nVIDIA Tesla M2070, the upper limit in N is ∼ 8, 400, 000, which is, anyway, a number sufficiently large to guarantee excellent resolution in the simulation of most of the astrophysically interesting cases involving stellar systems. A the light of the results obtained in this paper, we are convinced that it is worth testing some other commercially available high-end GPUs apt to work in a scientific environment, like, for instance, the AMD of the HD6970 and 7970 series, which we showed (astrowww.phys.uniroma1.it/dolcetta/HPCcodes/HiGPUs.html) to be absolutely competitive in terms of performance/cost ratio. This suggests as clever the idea of adopting for department sized high end computing platforms the solution of few tens of nodes composed by exacore CPUs connected to 2 up to 4 HD7970 GPUs acting as computing accelerators; all this at a very reasonable cost, both on the purchase cost and power consumption.