nvcomp is a CUDA library that features generic compression interfaces to enable developers to use high-performance GPU compressors in their applications.
nvcomp 1.0 includes Cascaded and LZ4 compression methods. Cascaded compression methods demonstrate high performance with up to 500GB/s throughput and a high compression ratio of up to 80x on numerical data from analytical workloads. LZ4 methods feature up to 60 GB/s decompression throughput and good compression ratios for arbitrary byte streams.
Below are performance vs. compression ratio scatter plots for a few numerical columns from TPC-H and Fanny Mae’s Mortgage datasets for Cascaded compression (R1D1B1), and string columns from TPC-H for LZ4 compression. For the TPC-H dataset we used SF10 lineitem table, and the following columns: column 0 (L_ORDERKEY) as 8B integers, and columns 8, 9, 13, 14, 15 as byte streams. From the Mortgage dataset we used 2009 Q2 performance table: column 0 (LOAN_ID) as 8B integers and column 10 (CURRENT_LOAN_DELINQUENCY_STATUS) as 4B integers. The numbers were collected on a Tesla V100 PCIe card (with ECC on). Note that you can tune the Cascaded format settings (e.g. the number of RLE layers) for even better compression ratio for some of these datasets.
The library is designed to be modular with ability to add new implementations without changing the high-level user interface. We’re working on additional schemes, and also on a “how to” guide for developers to add their own custom algorithms. Stay tuned for updates!
Below you can find instructions on how to build the library, reproduce our benchmarking results, a guide on how to integrate into your application and a detailed description of the compression methods. Enjoy!
The initial public release of nvcomp. Full details in CHANGELOG.md.
-
Cascaded compression requires a large amount of temporary workspace to operate. Current workaround is to compress/decompress large datasets in pieces, re-using temporary workspace for each piece.
-
While the LZ4 decompression kernels are well-optimized, the compression implementation has not been tuned for performance yet (currently work-in-progress).
NVComp uses CMake for building. Generally, it is best to do an out of source build:
mkdir build/
cd build
cmake ..
make
If you're building using CUDA 10 or less, you will need to specify a path to CUB on your system.
cmake -DCUB_DIR=<path to cub repository>
GPU requirement:
- It's recommended to run on Volta architecture or higher (the library was not tested on Pascal and below, but may work)
To obtain TPC-H data:
- Clone and compile https://github.com/electrum/tpch-dbgen
- Run
./dbgen -s <scale factor>
, then grablineitem.tbl
To obtain Mortgage data:
- Download 16 Years archive from https://rapidsai.github.io/demos/datasets/mortgage-data
- Unpack and copy
mortgage/perf/Perforamnce_2009Q2.txt
Convert CSV files to binary files:
benchmarks/text_to_binary.py
is provided to read a.csv
or text file and output a chosen column of data into a binary file- For example, run
python benchmarks/text_to_binary.py lineitem.tbl <column number> <datatype> column_data.bin '|'
to generate the binary datasetcolumn_data.bin
for TPC-H lineitem column<column number>
using<datatype>
as the type - Note: make sure that the delimiter is set correctly, default is
,
Run tests:
- Run
./bin/benchmark_cascaded
or./bin/benchmark_lz4
with-f BIN:column_data.bin <options>
to measure throughput.
Below are some reference benchmark results on a Tesla V100 for TPC-H.
Example of compressing and decompressing TPC-H SF10 lineitem column 0 (L_ORDERKEY) using Cascaded with RLE + Delta + Bit-packing (input stream is treated as 8-byte integers):
$ ./bin/benchmark_cascaded -f BIN:/tmp/lineitem-col0-long.bin -t long -r 1 -d 1 -b 1 -m
----------
uncompressed (B): 479888416
compression memory (input+output+temp) (B): 2400694079
compression temp space (B): 1200972879
compression output space (B): 719832784
comp_size: 15000160, compressed ratio: 31.99
compression throughput (GB/s): 184.06
decompression memory (input+output+temp) (B): 930155136
decompression temp space (B): 435266560
decompression throughput (GB/s): 228.53
Example of compressing and decompressing TPC-H SF10 lineitem column 15 (L_COMMENT) using LZ4:
$ ./bin/benchmark_lz4 -f BIN:lineitem-col15-string.bin -m
----------
uncompressed (B): 2579400236
compression memory (input+output+temp) (B): 5303691651
compression temp space (B): 134775815
compression output space (B): 2589515600
comp_size: 1033539977, compressed ratio: 2.50
compression throughput (GB/s): 1.53
decompression memory (input+output+temp) (B): 3612940213
decompression temp space (B): 0
decompression throughput (GB/s): 26.16
Note: Your TPC-H performance results may not precisely match the output above, since the TPC-H data generator produces random data - you need to use the same seed to have byte-matching input files.