Billion rows processed per second at a single-node PostgreSQL

I have worked on benchmarking of PG-Strom at a large hardware configuration for a couple of months.
Due to the server models we had, our benchmark results had been usually measured at a small 1U rack server with 1CPU, 1GPU and 3-4 NVME-SSDs, even though PG-Strom supports multi-GPUs and striping of NVME-SSDs.
So, we determined to arrange more powerful hardware environment for a validation of maximum performance of PG-Strom a few months before. After that, we could confirm billion-rows processed per second performance at a single-node PostgreSQL system using Star Schema Benchmark (SSBM), as we have expected.
This article introduces the technology briefs, benchmark results and expected applications.

Hardware configuration for the benchmark

The diagram below shows the hardware configuration we prepared for the benchmark project.

The 4U-rack server (Supermicro SYS-4029GP-TRT) is a certified model for NVIDIA Tesla V100 GPUs *1, and optimized for NVIDIA GPUDirect RDMA by PCIe-switch.
PCIe-switch has two purpose here. It enables to install more device than host system’s PCIe lanes *2, and also enables to bypass CPU on peer-to-peer data transfer between PCIe devices under the PCIe-switch.
It is an ideal configuration for us, because PG-Strom fully utilize the P2P data transfer on SSD-to-GPU Direct SQL mode, and it is the key of performance.
U.2 NVME-SSD drives (Intel DC P4510) are installed on the external JBOF enclosures, and connected to the host system using PCIe host cards and direct attach cables *3. This host card is designed to install on PCIe x16 slot, and capable to connect four NVME-SSD drives that support PCIe x4 bandwidth. So, data transfer bandwidth over the PCIe-bus shall balance on 1xGPU and 4xSSDs.
Below is the block diagram of our benchmark environment.

Element technology① - SSD-to-GPU Direct SQL

It is a characteristic feature of PG-Strom. By co-operation with a special Linux kernel driver (nvme_strom) that intermediates peer-to-peer DMA from NVME-SSD to GPU over PCIe-bus, PG-Strom can directly load PostgreSQL data blocks on NVME-SSD drives onto GPU’s device memory, but CPU/RAM is bypassed.

Usually, software cannot determine which items are necessary and which ones are junks, prior to data blocks are loaded to system RAM. On the other word, we consume I/O bandwidth and CPU cycles to copy junk data.
SSD-to-GPU Direct SQL changes the data flow. GPU pre-processes SQL workloads on the middle of I/O path to eliminate unnecessary rows and runs JOIN/GROUP BY steps. So, CPU/RAM eventually receives just a small portion of the result set from the GPU.

Element technology② - Arrow_Fdw

Apache Arrow is a column-oriented data format for structured data-set, and many applications (especially, big-data and data analytics area) support it for data exchange. PG-Strom supports to read Apache Arrow files as its columnar store, in addition to PostgreSQL’s row data blocks.

Due to the nature of data format, columnar-data enables to read only referenced columns, unlike row-data. It usually reduce amount of I/O to be read from the storage.
Arrow_Fdw is designed for direct read from Apache Arrow files without data importing to database system. It means we can eliminate one time-consuming steps, if conprehensive software generates imput data in Apache Arrow format. It is a suitable characteristic for IoT/M2M log processing system that usually generates tons of data to be loaded to data processing phase.

Star Schema Benchmark and PostgreSQL partition

As our usual, we used Star Schema Benchmark (SSBM) for performance measurement.
Its scaling factor of lineorder is SF=4000, then it shall be distributed to four partition-leafs using hash-partition of PostgreSQL.
So, individual partition-leafs has about 879GB (6 billion rows) row-data, and total size of lineorder is about 3.5TB.

We also set up equivalent Apache Arrow files for each partition leaf, and set up one another partition table that is consists of foreign-tables on Arrow_Fdw.

SSBM defines 13 queries that containes a scan on lineorder that is the largest portion and JOIN / GROUP BY. For example, the query below is Q2_3.

select sum(lo_revenue), d_year, p_brand1
  from lineorder, date1, part, supplier
  where lo_orderdate = d_datekey
    and lo_partkey = p_partkey
    and lo_suppkey = s_suppkey
     and p_brand1 = 'MFGR#2221'
     and s_region = 'EUROPE'
  group by d_year, p_brand1
  order by d_year, p_brand1;

Benchmark Results

Below is the benchmark result. The vertical axis means number of rows processed per second.
Its definition is (number of lineorder; 24billion rows) / (SQL response time).
For example, PG-Strom + Arrow_Fdw responds the result of Q1_2 within 14.0s, so it means 1.71 billion rows were processed per second.

The blue bar is PostgreSQL v11.5 with a manual tuning to launch 24 parallel workers *4. Its performance was about 50-60 million rows per second.

The orange bar is PG-Strom on row-data of PostgreSQL by SSD-to-GPU Direct SQL. Its performance was about 250 million rows per second.

These two series has little variation over the 13 queries, because i/o dominates the overall workloads rather than JOIN / GROUP BY, thus equivalent amount of storage i/o led almost equivalent performance.

The green bar is PG-Strom + Arrow_Fdw with SSD-to-GPU Direct SQL, is the topmost performance.
Due to the nature of columnar data, query performance was variable according to the number of referenced columns.
Regarding of the group of Q1_* and Q2_*, we could validate the performance more than billion rows processed per second.

The graph above is time-seriesed result of iostat.
It shows PG-Strom could load the data from 16x NVME-SSD drives in 40GB/s during the query execution.
Here are 13 mountains because SSBM defines 13 queries.

Conclusion

This benchmark results shows that a proper configuration of hardware and software can process reporting and analytics queries on terabytes grade data-set more than billion rows per second performance.
This grade of performance will change our assumption towards system architecture to process "big-data".
People have usually applied cluster system without any choice, however, accelerated PostgreSQL with GPU + NVME can be an alternative choice for them.

We think our primary target is log-data processing at IoT/M2M area where many devices generate data day-by-day. The raw data is usually huge for visualization or machine-learning, so needs to be summarized first prior to other valuable tasks.
A single node configuration makes system administration much simpler than cluster-based solution, and PostgreSQL is a long-standing software thus many engineers are already familiar with.

Appendix - hardware components.

This parts list is just for your reference

	Parts	Qty
model	Supermicro SYS-4029GP-TRT	1
CPU	Intel Xeon Gold 6226 (12C, 2.7GHz)	2
RAM	16GB DIMM (DDR4-2933; ECC)	12
GPU	NVIDIA Tesla V100 (PCI-E; 16GB)	2
GPU	NVIDIA Tesla V100 (PCI-E; 32GB)	2
HDD	Seagate 2.5inch 1.0TB/2.0TB	4
JBOF	SerialCables PCI-ENC8G-08A	2
SSD	Intel DC P4510 (U.2; 1.0TB)	16
HBA	SerialCables PCI-AD-x16HE-M	4

*1:This model is capable to install up to 8 GPUs

*2:Xeon Gold 6226 [Cascade Lake] has 48lanes

*3:Supermicro does not support to replace the internal storage backplane to the product that support NVME.

*4:10 workers were launched in the default, so it means 2-3 workers per partition were assigned. It is too small and cannot pull out i/o capability because CPU usage ratio was 100% during the execution.