GTC2013

NVIDIA Tesla K20 Benchmark with Financial Applications

| 27 November, 2012

Tesla K20.

To the best of our knowledge, Xcelerit just posted first ever benchmark for Tesla K20 (Kepler architecture), on real-world financial applications. This post compares the performance achieved with the Xcelerit platform on the Tesla M2050 (Fermi architecture) to the Tesla K20 (Kepler architecture). As an example application, we use the Monte-Carlo LIBOR swaption portfolio pricer algorithm, a real-world computational finance algorithm

Monte-Carlo LIBOR Swaption Portfolio Pricing

A Monte-Carlo simulation is used to price a portfolio of LIBOR swaptions. Thousands of possible future development paths for the LIBOR interest rate are simulated using normally-distributed random numbers. For each of these Monte-Carlo paths, the value of the swaption portfolio is computed by applying a portfolio payoff function. The equations for computing the LIBOR rates and payoff are given here. Furthermore, the sensitivity of the portfolio value with respect to changes in the underlying interest rate is computed using an adjoint method. This sensitivity is a Greek, called λ, as detailed here. Both the final portfolio value and the λ value are obtained by computing the mean of all per-path values.

The figure below illustrates the algorithm:

More details on the algorithm can be found in the following publication:

“Many-core Accelerated LIBOR Swaption Portfolio Pricing” Published paper at the 5th IEEE Workshop in High-Performance Computing for Computational Finance (WHPCF) at SC12, November 2012, SALT LAKE CITY, Utah.

This paper describes the acceleration of a Monte-Carlo algorithm for pricing a LIBOR swaption portfolio on multi-core CPUs and GPUs using the Xcelerit platform. Speedups of up to 305x are achieved on two Nvidia Tesla M2050 GPUs and up to 20.8x on two Intel Xeon E5620 CPUs, compared to a sequential CPU implementation. This performance is achieved by using the Xcelerit platform – writing sequential, high-level C++ code and adopting a simple dataflow programming model. It avoids the complexity involved when using low-level high-performance computing frameworks such as OpenMP, OpenCL, CUDA, or SIMD intrinsics. The paper provides an overview of the Xcelerit platform, details how high performance is achieved through various automatic optimisation and parallelisation techniques, and shows how the tool can be used to implement portable accelerated Monte-Carlo algorithms in finance. It illustrates the implementation of the Monte-Carlo LIBOR swaption portfolio pricer and gives performance results. A comparison of the Xcelerit platform implementation with an equivalent low-level CUDA version shows that the overhead introduced is less than 1.5% in all scenarios.

PDF can be dowloaded from http://www.xcelerit.com/whpcf12/ (Registration is required)

Benchmark Setup

We compare the same application, implemented using the Xcelerit SDK 2.0.3, on two different systems. Their configuration is given in the following table:

Fermi System Kepler System
CPU 2 Intel Xeon E5620 2 Intel Xeon E5-2670
GPU 2 NVIDIA Tesla M2050 2 NVIDIA Tesla K20Xm
OS RHEL 5.4 (64bit) RHEL 6.2 (64bit)
RAM 24GB 64GB
GPU driver 304.22 304.47.06
CUDA Toolkit 4.2 4.2
Host Compiler GCC 4.4 GCC 4.4

Note that we are only comparing the GPU performance, so the difference in the used CPUs has no significant effect on the outcome.

Performance

We measured the computation times for the Monte-Carlo LIBOR swaption portfolio pricer on one GPU of each system, pricing a portfolio of 15 swaptions over 80 time steps and using varying numbers of Monte-Carlo paths. The run time of the full algorithm – including random number generation, data transfers, core computation, and reduction – is visualized for single and double precision in the graph below.

As we can see, there is a significant speedup when using the new K20Xm GPU (up to 1.9x). The table below shows the speedup factors of the Kepler vs. the Fermi GPU for different numbers of paths for better comparison:

Paths Speedup (single) Speedup (double)
16K 1.34x 1.61x
64K 1.51x 1.76x
256K 1.78x 1.86x
1024K 1.86x 1.89x

It is apparent that NVIDIA’s new Tesla K20Xm GPU gives a huge performance improvement for real-world applications – up to 1.9x in this example (very close to the theoretical peak: 2x). It can also be seen that the improvement for double precision is better than for single precision – something financial institutions will be pleased to hear.

[via Xcelerit blog, submitted by Hicham Lahlou]

Tags: , , , , , ,

Category: Computer Science

Comments are closed.