GTC2013

VexCL: Vector expression template library for OpenCL

| 6 July, 2012

Fork me on GitHub

VexCL is vector expression template library for OpenCL. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. Source code of the library is publicly available under MIT license. See the Doxygen-generated documentation at http://ddemidov.github.com/vexcl.

It is known that one of weakest side of OpenCL is lack of tools and libraries around it and the amount of boilerplate code needed to develop OpenCL applications. The VexCL library tries to solve the latter issue. Denis Demidov, VexCL principal developer also wrote introductory article describing main features of VexCL interface. The first part is an introduction to the VexCL interface. The second part compares VexCL performance with existing GPGPU implementations. The comparison is based on odeint - a modern C++ library for numerical solutions of ordinary differential equations.

With his permission we repost part of his article below:

Selection of compute devices

You can select any number of available compute devices, which satisfy provided filters. Filter is a functor returning bool and acting on a cl::Device parameter. Several standard filters are provided, such as device type or name filter, double precision support etc. Filters can be combined with logical operators. In the example below all devices with names matching “Radeon” and supporting double precision are selected:

#include <vexcl/vexcl.hpp>
using namespace vex;
int main() {
    vex::Context ctx(Filter::Name("Radeon") && Filter::DoublePrecision);
    std::cout << ctx << std::endl;
}

vex::Context object holds list of initialized OpenCL contexts and command queues for each filtered device. If you just need list of available devices without creating contexts and queues on them, then look for device_list() function in documenation.

Memory allocation and vector arithmetic

Once you got queue list, you can allocate OpenCL buffers on the associated devices. vex::vector constructor accepts std::vector of cl::CommandQueue. The contents of the created vector will be partitioned between each queue (presumably, each of the provided queues is linked with separate device). Size of each partition will be proportional to relative device bandwidth unless macro VEXCL_DUMB_PARTITIONING is defined, in which case equal partitioning scheme will be applied. Device bandwidth is measured first time it is requested by launch of small test kernel.

Multi-platform computation is supported (that is, you can spread your vectors across devices by different vendors), but should be used with caution: all computations will be performed with the speed of the slowest device selected.

In the example below host vector is allocated and initialized, then copied to all GPU devices found in the system. A couple of empty device vectors are allocated as well:

const size_t n = 1 << 20;
std::vector x(n);
std::generate(x.begin(), x.end(), [](){ return (double)rand() / RAND_MAX; });
vex::Context ctx(Filter::Type(CL_DEVICE_TYPE_GPU));
vex::vector X(ctx.queue(), x);
vex::vector Y(ctx.queue(), n);
vex::vector Z(ctx.queue(), n);

You can now use simple vector arithmetic with device vector. For every expression you use, appropriate kernel is compiled (first time it is encountered in your program) and called automagically. If you want to see sources of the generated kernels on the standard output, define VEXCL_SHOW_KERNELS macro before including VexCL headers.

Vectors are processed in parallel across all devices they were allocated on:

Y = 42;
Z = sqrt(2 * X) + cos(Y);

You can copy the result back to host or you can use vector::operator[] to read (or write) vector elements directly. Though latter technique is very ineffective and should be used for debugging purposes only.

copy(Z, x);
assert(x[42] == Z[42]);

Another frequently performed operation is reduction of a vector expression to single value, such as summation. This can be done with vex::Reductor class:

Reductor sum(ctx.queue());
std::cout << sum(Z) << std::endl;
std::cout << sum(sqrt(2 * X) + cos(Y)) << std::endl;

Sparse matrix-vector multiplication

One of the most common operations in linear algebra is matrix-vector multiplication. Class vex::SpMat holds representation of a sparse matrix, spanning several devices. In the example below it is used for solution of a system of linear equations with conjugate gradients method:

typedef double real;
// Solve system of linear equations A u = f with conjugate gradients method.
// Input matrix is represented in CSR format (parameters row, col, and val).
void cg_gpu(
        const std::vector &row, // Indices to col and val vectors.
        const std::vector &col, // Column numbers of non-zero elements.
        const std::vector   &val, // Values of non-zero elements.
        const std::vector   &rhs, // Right-hand side.
        std::vector &x            // In: initial approximation; out: result.
        )
{
    // Init OpenCL.
    vex::Context ctx(Filter::Type(CL_DEVICE_TYPE_GPU));
    // Move data to compute devices.
    size_t n = x.size();
    vex::SpMat  A(ctx.queue(), n, row.data(), col.data(), val.data());
    vex::vector f(ctx.queue(), rhs);
    vex::vector u(ctx.queue(), x);
    vex::vector r(ctx.queue(), n);
    vex::vector p(ctx.queue(), n);
    vex::vector q(ctx.queue(), n);
    Reductor max(ctx.queue());
    Reductor sum(ctx.queue());
    // Solve equation Au = f with conjugate gradients method.
    real rho1, rho2;
    r = f - A * u;
    for(uint iter = 0; max(fabs(r)) > 1e-8 && iter < n; iter++) {
        rho1 = sum(r * r);
        if (iter == 0) {
            p = r;
        } else {
            real beta = rho1 / rho2;
            p = r + beta * p;
        }
        q = A * p;
        real alpha = rho1 / sum(p * q);
        u += alpha * p;
        r -= alpha * q;
        rho2 = rho1;
    }
    // Get result to host.
    copy(u, x);
}

User-defined functions

Simple arithmetic expressions are sometimes not enough. Imagine that you need to count how many elements in vector x are greater that their counterparts in vector y. This may be achieved by introduction of custom function. In order to build such a function, you need to supply its body, its return type and types of its arguments. After that, you can apply the function to any valid vector expressions:

// Function body has to be defined at global scope, and it has to be of `extern
// const char[]` type. This allows us to use it as a template parameter.
extern const char greater_body[] = "return prm1 > prm2 ? 1 : 0;";
UserFunction greater;
size_t count_if_greater(
    const Reductor &sum,
    const vex::vector &x,
    const vex::vector &y
    )
{
    return sum(greater(x, y));
}

You could also write sum(greater(x + y, 5 * y)), or use any other expressions as parameters to the greater() call. Note that in the function body parameters are always named as prm1, prm2, etc.

Using custom kernels

Custom kernels are of course possible as well. vector::operator(uint) returns cl::Buffer object for a specified device:

vex::Context ctx(Filter::Type(CL_DEVICE_TYPE_GPU));
const size_t n = 1 << 20;
vex::vector x(ctx.queue(), n);
auto program = build_sources(context, std::string(
    "kernel void dummy(ulong size, global float *x)n"
    "{n"
    "    size_t i = get_global_id(0);n"
    "    if (i < size) x[i] = 4.2;n"
    "}n"
    ));
for(uint d = 0; d < ctx.size(); d++) {
    auto dummy = cl::Kernel(program, "dummy").bind(ctx.queue()[d], alignup(n, 256), 256);
    dummy((cl_ulong)x.part_size(d), x(d));
}
Reductor sum(ctx.queue());
std::cout << sum(x) << std::endl;

Scalability

In the images below, scalability of the library with respect to number of compute devices is shown. Effective performance (GFLOPS) and bandwidth (GB/sec) were measured by launching big number of test kernels on one, two, or three Nvidia Tesla C2070 cards. Effect of adding fourth, slower, device (Intel Core i7) were tested as well. The results shown are averaged over 20 runs.

The details of the experiments may be found in examples/benchmark.cpp file. Basically, performance of the following code was measured:

// Vector arithmetic
a += b + c * d;

// Reduction
double s = sum(a * b);

// Stencil convolution
y = x * s;

// SpMV
y += A * x;
perf.png

As you can see, performance and bandwidth for stencil convolution operation are much higher than for other primitives. This is due to the fact that much faster local (shared) memory is used in this algorithm, and formulas for effective performance and bandwidth do not take this into account.

Another thing worth noting is overall degradation of performance after Intel CPU is added to VexCL context. The only primitive gaining speed from this addition is vector arithmetic. This is probably because performance of vector arithmetic was used as a basis for problem partitioning.

Supported compilers

VexCL makes heavy use of C++11 features, so your compiler has to be modern enough. GCC version 4.6 and above is fully supported. Microsoft Visual C++ 2010 manages to compile the project with some features disabled: since it does not support variadic templates, only one-argument builtin functions are enabled; user functions are not available at all.

Read full article at codeproject.com

Download source code – 193 KB

[Submitted by Denis Demidov,  Kazan Federal University, Russia]

Tags: , , , , , ,

Category: Code Examples, Computer Science

Comments are closed.