We describe the use of Graphics Processing Units (GPUs) for speeding up the code NBODY 6 which is widely used for direct N-body simulations. Over the years, the N2 nature of the direct force calculation has proved a barrier for extending the particle number. Following an early introduction of force polynomials and individual time-steps, the calculation cost was first reduced by the introduction of a neighbour scheme. After a decade of GRAPE computers which speeded up the force calculation further,we are now in the era of GPUs where relatively small hardware systems are highly cost-effective. A significant gain in efficiency is achieved by employing the GPU to obtain the so-called regular force which typically involves some 99 percent of the particles, while the remaining local forces are evaluated on the host. However, the latter operation is performed up to 20 times more frequently and may still account for a significant cost. This effort is reduced by parallel SSE/AVX procedures where each interaction term is calculated using mainly single precision. We also discuss further strategies connected with coordinate and velocity prediction required by the integration scheme. This leaves hard binaries and multiple close encounters which are treated by several regularization methods. The present NBODY 6–GPU code is well balanced for simulations in the particle range 104 − 2 x 105 for a dual GPU system attached to a standard PC.
We have presented new implementations for efficient integration of the N-body problem with GPUs. In the standard NBODY 6 code, the regular force calculation dominates the CPU time. Consequently, the emphasis here has been on procedures for speeding up the force calculation. First the regular force evaluation was implemented on the GPU using the library GPUNB which also forms the neighbour list. This procedure is ideally suited to massively parallel force calculations on GPUs and resulted in significant gains. However, a subsequent attempt to employ the GPU for the irregular force showed that the overheads are too large. It turned out that different strategies are needed for dealing with the regular and irregular force components and this eventually led to the development of the special library GPUIRR. The use of SSE and OpenMP speeded up this part such that the respective wall-clock times are comparable for a range of particle numbers.
After the recent hardware with AVX support became available, the library was updated. This led to additional speed-up of the irregular force calculation. It is also essential that the regular force part scales well on multiple GPUs. Thus in the future, further speed-up may be achieved by using four GPUs and an octocore CPU. It should be noted that in the present scheme, the use of multiple GPUs only benefits the regular force calculation which therefore scales well. Although large N-body simulations are still quite expensive, we have demonstrated that the regular part of the Ahmad–Cohen neighbor scheme is well suited for use with GPU hardware. Moreover, the current formulation of the NBODY 6–GPU code performs well for a variety of difficult conditions.