NVIDIA just published interesting blog post about Kepler’s new Hyper-Q feature. They report how Hyper-Q helps increase performance for thousands of legacy MPI applications without requiring a major code rewrite.
How Hyper-Q Works
A GPU consists of multiple CUDA cores grouped into streaming multiprocessors operating in parallel. A hardware unit called the CUDA Work Distributor (CWD) is responsible for assigning work to the individual multiprocessors.
In the current Fermi architecture, the CWD has a single connection to the host CPU and work from different MPI processes is merged into this single queue. This serialization could easily lead to false dependencies among work from different MPI processes, limiting the amount of work that can be executed concurrently on the GPU. This often results in an underutilized GPU.
Hyper-Q removes this limitation. As shown in the graphic, the new Kepler-based Tesla K20 GPU provides 32 work queues between the host and the GPU, enabling multiple MPI processes to run concurrently on the GPU. Each MPI process can be assigned to a different hardware work queue, maximizing GPU utilization and increasing overall performance.
|By enabling more MPI processes on the GPU, Hyper-Q maximizes GPU utilization,
increasing overall performance.
Reduced Development Effort for Legacy MPI Codes
While MPI developers will be thrilled with the added performance, they’ll be equally enamored with how Hyper-Q makes porting legacy MPI codes to the GPU significantly easier.
Legacy MPI-based codes were often created to run on multicore CPU systems, with the amount of work assigned to each MPI process scaled accordingly. However, this often meant that MPI processes didn’t generate enough work to fully occupy the GPU. To make the code launch enough work to fully utilize the GPU, developers frequently were required to modify their codes significantly.
Hyper-Q reduces recode efforts considerably because developers can now throw many MPI processes with small- and medium-size workloads at a shared GPU. Developers no longer need to modify their codes to put enough work into a single MPI process. Rather, they can send up to 32 MPI processes with variable workloads to the GPU and just let the GPU do all the heavy lifting to maximize performance.
Case In Point: CP2K
CP2K is a widely used atomic and molecular simulation code that runs at many of the world’s supercomputing sites. CP2K is parallelized using MPI and OpenMP, and CUDA is used in some models where GPUs are targeted.
With Fermi-based GPUs, developers actually experienced reduced performance gains when MPI processes were limited to small amounts of work, particularly in strong scaling simulations. While the CPU was highly utilized, the GPU stayed completely inactive in substantial portions of the simulation.
The following benchmark below shows the impact of Hyper-Q.
This small data set of 864 water molecules is usually problematic for GPUs. Without Hyper-Q, only one MPI process runs on each node with GPUs, and the performance curve from 1 to 16 nodes is not much better than with CPU-only simulations.
With Hyper-Q, it is now possible to use the same number of MPI processes per node as in the CPU-only case, which means 16 MPI processes per GPU in this instance. This unlocks the full benefit of the GPU, leading to a speedup of 2.5x with Hyper-Q enabled.
And the best part? No extra coding effort is necessary to enable Hyper-Q. All it takes is a Tesla K20 GPU with a CUDA 5 installation and setting an environment variable to let multiple MPI ranks share the GPU – Hyper-Q is then ready to use.
Be Prepared, Start Today
The Tesla K20 will be the first GPU to feature Hyper-Q. It’s scheduled to be available by the end of the year, but you can start preparing today.
Begin by accelerating your code using OpenACC. With OpenACC directives, developers simply insert compiler hints into the code and the compiler will automatically map compute-intensive portions of the code to the GPU. By using directives within MPI processes, you don’t need to worry about how much workload is created by the OpenACC compiler because Hyper-Q ensures the GPU stays as occupied as possible.