In a recent blog post NVIDIA discusses the new Dynamic Parallelism feature of upcoming GPU Kepler K20 using Quicksort as an example. Dynamic Parallelism allows the GPU to operate more autonomously from the CPU by generating new work for itself at run-time, from inside a kernel. The concept is simple, but the impact is powerful: it can make GPU programming easier, particularly for algorithms traditionally considered difficult for GPUs such as divide-and-conquer problems.
On GPUs based on the current Fermi architecture, there exists a one-way, fixed execution flow from the host CPU to the cores in the GPU. This is illustrated on the left side of the chart below.
(Left): Without Dynamic Parallelism, (Right): With Dynamic Parallelism
With Dynamic Parallelism, the GPU is able to generate new work for itself without involving the CPU at all. This permits dynamic run-time decisions about what to do next, enabling much more complex algorithms than previously were possible (illustrated on the right side of the chart), while simultaneously releasing the CPU to conserve power or perform other work.
Now let’s take a look at the actual CUDA code for Quicksort, with and without Dynamic Parallelism.
Quicksort with Dynamic Parallelism
Quicksort without Dynamic Parallelism
Even if you aren’t a programmer you’ll notice that Quicksort with Dynamic Parallelism is half the size of the code without it. And it’s much easier to follow. Here’s why.
In Quicksort, the information needed to sort each stage depends on the stage before it. Without Dynamic Parallelism all of the launches must take place from the CPU, which means that the details of what to launch next must be passed back to the host after each stage. For simplicity, the example encapsulates this communication in a CPU/GPU work stack; this can be highly complex in its own right, requiring atomics, data management, and as much code as the Quicksort algorithm itself.
But, with Dynamic Parallelism the GPU performs its own launches on-the-fly, enabling each Quicksort to launch its two sub-sorts as soon as it has finished. There are no complex overheads like the CPU/GPU stack exchange, and no need for all the host code which manages the launches.
Category: Code Examples