(EN) GpuScan + SSD-to-GPU Direct DMA

An article for none-Japanese readers....

What I'm recently working on is a feature to load data blocks from NVMe-SSD to GPU using peer-to-peer DMA.
It allows to bypass the CPU/RAM under a series of data loading process, thus, also allows to reduce number of expensive data copy.
Once data blocks are loaded onto the GPU's RAM, we can process individual records within the blocks, by thousands processor cores, according to the GPU binary which shall be generated based on SQL query.

Here is my longstanding project; PG-Strom that is an extension of PostgreSQL to off-load multiple SQL workloads on GPU devices, transparently. It has been developed for four years, and now supports simple scan, tables join, aggregation and projection.
Its prime focus is CPU intensive workloads, on the other hands, it didn't touch storage subsystem of PostgreSQL because the earlier version of PG-Strom assumes all the data set shall be pre-loaded onto physical memory. No need to say, we had a problem when we want to process a data-set larger than physical RAM.

One my solution is a feature enhancement to support data loading using SSD-to-GPU Direct DMA.
Please assume a use case of a large table scan when its WHERE clause filters out 90% of rows. Usually, DBMS once loads entire data blocks from the storage to RAM, then evaluate individual records according to the WHERE clause. However, this simple but massive calculation is a job GPU can process much more efficiently, and can eliminate 90% of the data prior to loading onto CPU/RAM. It also enables to keep much more valuable data rather than cache out by junk data.

Once we could load data blocks to GPU prior to CPU, we already have a feature to run tables join and (partial) aggregation on GPU. From the standpoint of CPU, it looks like the storage system gets an intelligence to run a part of SQL workloads, because row-filtering, tables join and aggregation are already done when data stream is arrived at CPU in respond to its i/o requests.

A few days before, the initial version of SSD-to-GPU Direct DMA driver gets working. So, I tried to measure its throughput to load data blocks from a test file to GPU RAM. The result was remarkable.
The "NVMe-Strom" is a record of SSD-to-GPU Direct DMA. Its throughput to load data blocks onto GPU device exceeds about 2200MB/s; which is catalog spec of Intel SSD 750(400GB).
It is also twice faster than a pair of usual read(2) and cuMemcpyHtoDAsync; that it a CUDA API to kick asynchronous RAM-to-GPU DMA. When i/o size is smaller, its system-call invocation cost drives down the throughput furthermore.

CPU: Intel Xeon E5-2670 v3 (12C, 2.3GHz)
GPU: NVIDIA Tesla K20c (2496C, 706MHz)
RAM: 64GB
OS: CentOS 7 (kernel: 3.10.0-327.18.2.el7.x86_64)
FS: Ext4 (bs=4096)

In addition, I tried to measure the performance of a few SQL queries:

A large table scan with simple WHERE clause
A large table scan with complicated WHERE clause
A large table scan with LIKE clause

The tables size is about 64GB, contains 700million rows. The result is quite interesting.

The result of "PG-Strom" (red) shows the performance of GPU acceleration based on the existing storage system of PostgreSQL. It implies the throughput is dominated by i/o, thus unavailable to complete tables scan less than 140sec.
It is valuable when WHERE clause takes complicated expression, but less advantage in other cases.

On the other hands, "PG-Strom + NVMe-Strom" (blue) broke this limitation. In case of simple query, it scans the 64GB table by 42.67sec, thus, its processing throughput is 1535MB/s.
We can expect PG-Strom is more valuable for more CPU intensive workloads, like JOIN or GROUP BY, rather than simple tables scan. I expect the pair of PG-Strom and NVMe-Strom will accelerate this kind of workloads going beyond the existing i/o boundary.
I'm really looking forward to benchmarks of the workloads that contains JOIN and GROUP BY when I could add support of SSD-to-GPU Direct DMA on these features!

QUERY for table construction)

CREATE TABLE t_64g (id int not null,
                    x float not null,
                    y float not null,
                    z float not null,
                    memo text);
INSERT INTO t_64g (SELECT x, random()*1000, random()*1000,
                             random()*1000, md5(x::text)
                     FROM generate_series(1,700000000) x);

postgres=# \d+
                         List of relations
 Schema |       Name       | Type  | Owner  |  Size   | Description
--------+------------------+-------+--------+---------+-------------
 public | t                | table | kaigai | 965 MB  |
 public | t_64g            | table | kaigai | 66 GB   |

QUERY1)
SELECT * FROM t_64g WHERE x BETWEEN y-2 AND y+2;

QUERY2)
SELECT * FROM t_64g WHERE sqrt((x-200)^2 + (y-300)^2+ (z-400)^2) < 10;

QUERY3)
SELECT * FROM t_64g WHERE memo LIKE '%abcd%';

*1:Performance was measured again because asynchronous DMA mode was not enabled in the previous result. Thus, synchronous copy blocked unrelated tasks.