Accelerating CUDA graph algorithms at maximum warp
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.
Conclusion
Parallel execution of graph algorithms on GPUs suffered from workload imbalance for real-world graph instances. To address this issue, we proposed a novel virtual warp-centric programming method, a general strategy that prevents branch divergence and unnecessary scattering memory access. Users can also trade-off between SIMD utilization and branch divergence by varying the virtual warp size.
When applied to graph algorithms, our method resulted in up to 9x speedup over previous GPU implementations. In addition, a set of GPU benchmark applications that suffer from branch diver- gence and scattering memory accesses was accelerated by 1.3x to 15.1x over the baseline GPU implementations. Finally, we showed that the GPU executions of graph algorithms can outperform imple- mentations on other architectures, mainly because of substantially larger memory bandwidth.
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP ’11). ACM, New York, NY, USA, 267-276. [DOI 10.1145/1941553.1941590] [Free PDF]
Category: Computer Science






