A Framework for Automated Performance Tuning and Code Verification on GPU Computing Platforms

| 3 February, 2012

Emerging multi-core processor designs create a computing paradigm capable of advancing numerous scientific areas, including medicine, data mining, biology, physics, and earth sciences. However, the trends in multi-core hardware technology have advanced far ahead of the advances in software technology and programmer productivity. For the most part, current scientists only leverage multi-core and GPU (Graphical Processing Unit) computing platforms after painstakingly uncovering the inherent task and data-level parallelism in their application. In many cases, the development does not realize the full potential of the parallel hardware. There exists an opportunity to meet the challenges in optimally mapping scientific application domains to multi-core computer systems through the use of compile-time and link-time optimization strategies. We are exploring a code compilation framework that automatically generates and tunes numerical solver codes for optimal performance on graphical processing units. The framework advances computational simulation in kinetic modeling by significantly reducing the execution time of scientific simulations and enabling scientists to compare results to previous models and to extend, modify, and test new models without code changes.

Conclusions

Details of the performance speedup we have achieved on multicore using Intel’s Threaded Building Blocks (TBB) and on a multicore cluster using MPI can be found in this study [12] in which we examined the process of adapting computational biology models to multicore and distributed memory architectures. Fig. 2 demonstrates the performance increase we have achieved using GPU as compared with distributed memory processing using MPI (we also show linear GPU speedup to greater than 100x over sequential CPU that is not shown due to space limitations).

Figure 2. This graph compares the chromosomes processed per second by GPU, 32 core CPU, 64 core CPU, 128 core CPU, and 256 core CPU using MPI. We did not use more than 256 processes on the cluster because only 204 cores are available and at 256 cores the nodes are already oversubscribed.

The speedup we have achieved to date represent a fundamental change in the science that is possible. We have identified a model that has a better fit to experimental data and that more completely describes all key biological characteristics of AMPA receptors that underlie complex neuronal signaling. Fig. 3 demonstrates that 5% to 22% performance increase can be achieved on the GPU without the involvement of the application programmer through better register utilization. Our framework will automatically generate code for optimal block sizes and memory allocation. These are the architectural details so important to optimal performance but that are a barrier to broad use by scientists.

Figure 3. The y-axis shows the speedup between our baseline and an optimized implementation that uses registers more aggressively. Above one on the y-axis means speedup is achieved in the optimized version. The lines are different GPU block sizes (a setting that is automated in our platform but that typically needs to be determined by the programmer with little insight into the consequences of block size on application behavior). The increasing curve of the blue line indicates that with larger workloads memory latency is being hidden more effectively and that latency savings is being magnified over increased data parallelism. The other lines indicate there is a performance sweet spot on the GPU of over 20% that isn’t realized with standard compilation.

We have advanced computational simulation in kinetic modeling and are well underway with the development of the code generation framework for performance tuning. What remains is to demonstrate that performance is sustained across several models, prepare an in-depth comparison of GPU to CPU, and to integrate automated code verification in the framework. In future research, we would like to determine how generally applicable tools like KPP and NWChem are to the specifics of our chemical kinetics and perhaps provide auto-tuning for performance on the GPU within these systems to take advantage of the broader domain support they already have in place.

Allison S. Gehrke, Ilkyeun Ra and Daniel A. Connors. A Framework for Automated Performance Tuning and Code Verification on GPU Computing Platforms. IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, 2011, pp. 2113 – 2116. [doi: 10.1109/IPDPS.2011.390] [Free PDF]

Tags: , , , , , , ,

Category: Articles, Computer Science

Comments are closed.