Monte Carlo simulation is the most accurate method for absorbed dose calculations in radiotherapy. Its efficiency still requires improvement for routine clinical applications, especially for online adaptive radiotherapy. In this paper, we report our recent development on a GPU-based Monte Carlo dose calculation code for coupled electron–photon transport. We have implemented the dose planning method (DPM) Monte Carlo dose calculation package (Sempau et al 2000 Phys. Med. Biol. 45 2263–91) on the GPU architecture under the CUDA platform. The implementation has been tested with respect to the original sequential DPM code on the CPU in phantoms with water–lung–water or water–bone–water slab geometry. A 20 MeV mono-energetic electron point source or a 6 MV photon point source is used in our validation. The results demonstrate adequate accuracy of our GPU implementation for both electron and photon beams in the radiotherapy energy range. Speed-up factors of about 5.0–6.6 times have been observed, using an NVIDIA Tesla C1060 GPU card against a 2.27 GHz Intel Xeon CPU processor.
In this paper, we have successfully implemented the DPM Monte Carlo dose calculation package on GPU architecture under the NVIDIA CUDA platform. We have also tested the efficiency and accuracy of our GPU implementation with respect to the original sequential DPM code on the CPU in various testing cases. Our results demonstrate the adequate accuracy of our implementation for both electron and photon sources. Speed-up factors of about 5.0–6.6 times have been observed. The code is in public domain and available to readers on request.
MC simulations are known as embarrassingly parallel because they are readily adaptable for parallel computing. It has been reported that the DPM package has been parallelized on a CPU cluster, and roughly a linear increase of speed has been realized with respect to an increasing number of processors when up to 32 nodes of an Intel cluster are involved (Tyagi et al 2004). Nonetheless, this liner scalability is hard to achieve on GPU architectures, when simply distributing particles to all threads and treating them as if they were independent computational units.
In general, the means of performing parallel computation are categorized into task parallelization and data parallelization. MC simulation, a typical task parallelization problem, is preferable for a CPU cluster developed through, for example, message passing interface (MPI). All particles histories simulated in an MC dose calculation can be distributed to all processors, which execute simultaneously without interfering with each other. Only at the end of the computation will the dose distribution need to be collected from all processors. Apparently the parallelization of this manner is capable of speeding up the simulation easily with a large number of CPU nodes. On the other hand, the GPU is known to be suitable for the data parallelization problems. A GPU multiprocessor employs an architecture called SIMT (single-instruction, multiple-thread) (NVIDIA 2009). Under such architecture, the multiprocessor executes program in groups of 32 parallel threads termed warps. If the paths for threads within a warp diverge due to, e.g., some if–else statements, the warp serially executes one thread at a time while putting all other threads in an idle state. Thus, high computation efficiency is only achieved when 32 threads in a warp process together along the same execution path. Unfortunately, in an MC calculation, the computational work paths on different threads are statistically independent, essentially resulting in a serial execution within a warp. Since there are physically only 30 multiprocessors in a GPU, our simulation is indeed parallelized by just 30 independent computation units. Considering, furthermore, that the GPU clock speed is 1.3 GHz, about a half of that for the CPU, the highest possible speed-up factor for a GPU-based MC simulation will be roughly 15 times.
Other factors may also adversely restrict our simulation efficiency, such as the memory access pattern. Since all threads share the usage of a global memory in our code, the random access to different memory addresses from different threads produces a serious overhead. In addition, the global memory on the GPU is not cached, leading to 400–600 clock cycles of memory latency, while the CPU memory is normally cached and favorable for fast fetching. In a nut shell, due to many factors limiting our simulation efficiency, our code has only achieved speed-up about 5.0–6.6 times.
On the other hand, the GPU implementation of an MC simulation still has the obvious advantage with its low cost and easy accessibility as opposed to the CPU clusters. Our work has clearly demonstrated the potential possibility of speeding up an MC simulation with the help of the GPU. Currently, another MC simulation algorithm which is specifically tailored for GPU architecture is under development and a boost of MC simulation speed is expected.
Xun Jia, Xuejun Gu, Josep Sempau, Dongju Choi, Amitava Majumdar and Steve B Jiang. Development of a GPU-based Monte Carlo dose calculation code for coupled electron–photon transport. Physics in Medicine and Biology. Volume 55. Number 11. 2010. [doi: 10.1088/0031-9155/55/11/006] [Free PDF]