ispc is an R&D compiler for a C-based language that is targeted for exploring the performance available from doing SPMD (single program, multiple data) computation on SIMD units found on CPUs and on Intel Xeon Phi coprocessors (using the Intel Many Integrated Core (MIC) architecture). It has delivered performance competitive with hand-coded SSE and AVX for a variety of graphics and throughput kernels and typically delivers a 3x-4x speedup versus scalar C/C++ code on SSE and 5-7x speedup on AVX (for computations that are amenable to SPMD implementation), while still providing the ease-of-use of a C-like language.
The paper “ispc: A SPMD Compiler for High-Performance CPU Programming” by Matt Pharr and Bill Marks won “best paper award” at InPar 2012.
SIMD parallelism has become an increasingly important mechanism for delivering performance in modern CPUs, due its power efficiency and relatively low cost in die area compared to other forms of parallelism. Unfortunately, languages and compilers for CPUs have not kept up with the hardware’s capabilities. Existing CPU parallel programming models focus primarily on multi-core parallelism, neglecting the substantial computational capabilities that are available in CPU SIMD vector units. GPU-oriented languages like OpenCL support SIMD but lack capabilities needed to achieve maximum efficiency on CPUs and suffer from GPU-driven constraints that impair ease of use on CPUs.
We have developed a compiler, the Intel SPMD Program Compiler (ispc), that delivers very high performance on CPUs thanks to effective use of both multiple processor cores and SIMD vector units. ispc draws from GPU programming languages, which have shown that for many applications the easiest way to program SIMD units is to use a single-program, multiple-data (SPMD) model, with each instance of the program mapped to one SIMD lane. We discuss language features that make ispc easy to adopt and use productively with existing software systems and show that ispc delivers up to 35x speedups on a 4-core system and up to 240x speedups on a 40-core system for complex workloads (compared to serial C++ code).
It is an excellent paper that articulates the challenges of vectorization and explains the important context very well. It also advances a solid demonstration of what is possible when you think about SPMD on SIMD models clearly. For particular note is the approach of quantifying the effects of turning off individual proposed extensions (uniform, coherent control flow). Fundamental and key concepts like AOS vs. SOA are well explained (with drawings). I think everyone should understand how wider vectors (CPU and GPU) interact with programming languages and will shape programming for the next several decades… and everyone should read this paper in its entirety at least once.
ispc is available as an open source project from http://ispc.github.com.
Compiling For The Intel Xeon Phi Architecture
ispc has beta-level support for compiling for the many-core Intel® Xeon Phi architecture (formerly, “Many Integrated Cores” / MIC.) This support is based on the “generic” C++ output, described in the previous section.
To compile for Xeon Phi, first generate intermediate C++ code:
ispc foo.ispc --emit-c++ --target=generic-16 -o foo.cpp --c++-include-file=knc.h
The ispc distribution now includes a header file, examples/intrinsics/knc.h, which maps from the generic C++ output to the corresponding intrinsic operations for Intel Xeon Phi. Thus, to generate an object file, use the Intel C Compiler (icc) compile the C++ code generated by ispc, setting the#include search path so that it can find the examples/intrinsics/knc.h header file in the ispcdistribution.
With the current beta implementation, complex ispc programs are able to run on Xeon Phi, though there are a number of known limitations:
- The examples/intrinsics/knc.h header file isn’t complete yet; for example, vector operations with int8 and int16 types aren’t yet implemented. Programs that operate onvarying int32, float, and double data-types (and uniform variables of any data type, and arrays and structures of these types), should operate correctly.
- If you use the launch functionality to launch tasks across cores, note that the pthreads task system implemented in examples/tasksys.cpp hasn’t been tuned for Xeon Phi yet, and has known issues with setting thread affinities optimally.
- The compiler currently emits unaligned memory accesses in many cases where the memory address is actually aligned. This may unnecessarily impact performance.
All of these issues are currently actively being addressed and will be fixed in future releases.
If you do use the current version of ispc on Xeon Phi, please let us know of any bugs or unexpected results. (Also, any interesting results!). Note that access to Xeon Phi and public discussion of Xeon Phi performance is still governed by NDA, so please send email to “matt dot pharr at intel dot com” for any issues that shouldn’t be filed in the public ispc bug tracker.
[via Intel blog]