PostgreSQL + GPU = PGStrom
PG-Strom is a module of FDW (foreign data wrapper) of PostgreSQL database. It was designed to utilize GPU devices to accelarate sequential scan on massive amount of records with complex qualifiers. Its basic concept is CPU and GPU should focus on the workload with their advantage, and perform concurrently. CPU has much more flexibility, thus, it has advantage on complex stuff such as Disk-I/O, on the other hand, GPU has much more parallelism of numerical calculation, thus, it has advantage on massive but simple stuff such as check of qualifiers for each rows.
The below figure is a basic concept of PG-Strom. Now, on sequential scan workload, vanilla PostgreSQL does iteration of fetch a tuple and checks of qualifiers for each tuples. If we could consign GPU the workload of green portion, it enables to reduce workloads of CPU, thus, it shall be able to load more tuples in advance. Eventually, it should allow to provide shorter response-time on complex queries towards large amount of data. In measurement, it made possible to run a query with x10~x20 times shorter response-time that regular sequential-scan case, even though it depends on its workload, of course.
Installation
As a prerequisite, GPU device must support CUDA, thus, only NVidia’s products are available now. Some of cloud providers supports CUDA supported GPU devices, so, it may be an option.
- Install CUDA Toolkit v4.0 (or later)
- Check out the latest PostgreSQL tree.
- You can obtain the latest PostgreSQL tree at git.postgresql.org. Even though SQL/MED has been supported since v9.1, PG-Strom uses some of internal APIs that was modified at v9.2devel. Thus, it is not compilable towards v9.1.
- Build the PostgreSQL as usual. The pg_config should be installed on a directory listed on $PATH.
- Check out the latest PG-Strom tree.
- You can obtain the latest PG-Strom tree at github.com.
- Build the module with make. pg_config should be appeared in the $PATH. If CUDA was installed somewhere except for /usr/local/cuda, you should give the alternate installation path using CUDA_DIR
- Install the module with make install.
- Edit the $PGDATA/postgresql.conf
- PG-Strom must be loaded at shared_preload_library stage. So, '$libdir/pg_strom' shall be added to the shared_preload_library parameter.
- Restart PostgreSQL server.
- Restart the server using pg_ctl restart
- Create an extension of PG-Strom
- Run CREATE EXTENSION pg_strom to set up related foreign-data-wrapper, foreign-server, and functions.
Architecture
Characteristic of hardware components is an important factor to determine whole of system design. Even though GPU device has TFlops class capability of calculation, data being calculated has to be loaded to device memory; its capacity is usually smaller than host memory, and smaller bandwidth compared to the one between CPU and host-memory.

Thus, we need to exercise wisdom; how to copy data more fast between host and device. One idea is not to transfer data being unused to calculation. In case of a foreign-table of PG-Strom has 5 columns, but only 2 of them used to evaluation of qualifier, it is waste of bandwidth to copy contents of the rest of 3 columns being unused. In addition, we like to minimize number of disk-i/o because its latency and bandwidth are much slower. It requires us higher density of data on storage.

PG-Strom adopts column-oriented data structure. When a foreign-table managed by PG-Strom is defined, it implicitly creates shadow tables for each columns and table, under the pg_strom schema.
These shadow tables have rowid and a chunk of values represented as bytea value. Each tuples within shadow table contains nitem of elements within isnull bitmap and values array, PG-Strom can fetch a set of values from the database so fast. Note that a memcpy() enables to copy a chunk of values from database buffer to DMA buffer; being transferred to/from GPU device asynchronously. It is a great advantage that does not need to deform a tuple to fetch a particular field.

The above figure shows the mechanism. PG-Strom load data into chunk-buffer from shadow tables behind of the column referenced to the supplied qualifier, but nothing are loaded to used column. Then, the data loaded to chunk-buffer shall be copied to GPU device, it calculate the given data, and results shall be written-back. The series of steps shall be executed asynchronously, thus, CPU is available to load data onto the next chunk-buffer concurrently.
As a preferable side-effects, we can expect data compression and reduction of total amount of i/o, although it depends on characteristics of data set.

One other characteristic is run-time code generation of GPU. The query planner invokes the PlanForeignScan callback of PG-Strom. At that time, PG-Strom checks whether the supplied qualifier is executable on GPU, then, it generates a source code to implement a function equivalent to the qualifier, but it follows the manner of CUDA.
Then, the executor invokes executor callbacks of PG-Strom to scan the foreign-table. PG-Strom has three stages to scan underlying shadow tables. The first step is initialization; that initializes the GPU device and compile the automatically generated source code using nvcc compiler. It generates GPU executable binary. We tend to use same queries multiple times, so, it is worthwhile to cache the binary code. The second step is load; that load data being used in calculation from shadow tables to chunk-buffer, then kicks asynchronous memory copy and device execution. The chunk being kicked shall be calculated by GPU during CPU load the next chunk, then, CPU scan the shadow tables (if columns not referenced by qualifiers are required) according to the result of this calculation. All we need to fetch is tuples with particular row-id that is evaluated to be visible.

Read more at http://wiki.postgresql.org/wiki/PGStrom
Category: Software





