We demonstrate that radically differing implementations of finite element methods (FEMs) are needed on multi-core (CPU) and many-core (GPU) architectures, if their respective performance potential is to be realised. Our numerical investigations using a finite element advection–diffusion solver show that increased performance on each architecture can only be achieved by committing to specific and diverse algorithmic choices that cut across the high-level structure of the implementation. Making these commitments to achieve high performance for a single architecture leads to a loss of performance portability. Data structures that include redundant data but enable coalesced memory accesses are faster on many-core architectures, whereas redundancy-free data structures that are accessed indirectly are faster on multi-core architectures. The Addto algorithm for global assembly is optimal on multi-core architectures, whereas the Local Matrix Approach is optimal on many-core architectures despite requiring more computation than the Addto algorithm. These results demonstrate the value in making the correct choice of algorithm and data structure when implementing FEMs, spectral element methods and low-order discontinuous Galerkin methods on modern high-performance architectures.
We have investigated the implementation space for parallel global assembly in the finite element method using a variety of GPU platforms and a more traditional CPU architecture. These experiments, using an implementation of a finite element advection-diffusion solver.
Our results demonstrate that the algorithmic choice in finite element method implementations makes a big difference in performance, with the best choice depending on the target architecture. In particular, the Addto algorithm is preferable on CPU implementations, and the Local Matrix Approach is more efficient on GPU implementations. Additionally, the choice of data structures is also influenced by the target platform – data formats that contain redundant data are more efficient on many-core architectures, whereas structures that use indirection and do not store redundant data are more profitable on multi-core architectures.
We have demonstrated that although the OpenCL implementation is portable across CPU and GPU platforms, optimal performance cannot be achieved on all architectures without modifying the source code. This motivates the automation of code generation, allowing navigation of the various dimensions of the implementation space in order to pick the best implementation strategy for each context. This motivates our present work on a code-generation tool that transforms UFL sources into CUDA implementations. The code generator will be able to pick the choice of data structure and assembly algorithm depending on the context of the code generation. In order to select the optimal implementation, we believe it will be necessary to develop a performance model that assists in determining which algorithms will result in the greatest performance on the target hardware.
G.R. Markall, A. Slemmer, D.A. Ham, P.H.J. Kelly, C.D. Cantwell and S.J. Sherwin. Finite element assembly strategies on multi-core and many-core architectures. International Journal for Numerical Methods in Fluids, 2012. [doi: 10.1002/fld.3648] [Free PDF]