Skip to content
/ nvcomp Public
forked from NVIDIA/nvcomp

A library for fast lossless compression/decompression on the GPU

License

Notifications You must be signed in to change notification settings

jlowe/nvcomp

 
 

Repository files navigation

What is nvcomp?

nvcomp is a CUDA library that features generic compression interfaces to enable developers to use high-performance GPU compressors in their applications.

nvcomp 1.0 includes Cascaded and LZ4 compression methods. Cascaded compression methods demonstrate high performance with up to 500GB/s throughput and a high compression ratio of up to 80x on numerical data from analytical workloads. LZ4 methods feature up to 60 GB/s decompression throughput and good compression ratios for arbitrary byte streams.

Below are performance vs. compression ratio scatter plots for a few numerical columns from TPC-H and Fanny Mae’s Mortgage datasets for Cascaded compression (R1D1B1), and string columns from TPC-H for LZ4 compression. For the TPC-H dataset we used SF10 lineitem table, and the following columns: column 0 (L_ORDERKEY) as 8B integers, and columns 8, 9, 13, 14, 15 as byte streams. From the Mortgage dataset we used 2009 Q2 performance table: column 0 (LOAN_ID) as 8B integers and column 10 (CURRENT_LOAN_DELINQUENCY_STATUS) as 4B integers. The numbers were collected on a Tesla V100 PCIe card (with ECC on). Note that you can tune the Cascaded format settings (e.g. the number of RLE layers) for even better compression ratio for some of these datasets.

Cascaded compression performance

LZ4 performance

The library is designed to be modular with ability to add new implementations without changing the high-level user interface. We’re working on additional schemes, and also on a “how to” guide for developers to add their own custom algorithms. Stay tuned for updates!

Below you can find instructions on how to build the library, reproduce our benchmarking results, a guide on how to integrate into your application and a detailed description of the compression methods. Enjoy!

Version 1.0 Release

The initial public release of nvcomp. Full details in CHANGELOG.md.

Known issues

  • Cascaded compression requires a large amount of temporary workspace to operate. Current workaround is to compress/decompress large datasets in pieces, re-using temporary workspace for each piece.

  • While the LZ4 decompression kernels are well-optimized, the compression implementation has not been tuned for performance yet (currently work-in-progress).

Building the library

NVComp uses CMake for building. Generally, it is best to do an out of source build:

mkdir build/
cd build
cmake ..
make

If you're building using CUDA 10 or less, you will need to specify a path to CUB on your system.

cmake -DCUB_DIR=<path to cub repository>

Running benchmarks

GPU requirement:

  • It's recommended to run on Volta architecture or higher (the library was not tested on Pascal and below, but may work)

To obtain TPC-H data:

To obtain Mortgage data:

Convert CSV files to binary files:

  • benchmarks/text_to_binary.py is provided to read a .csv or text file and output a chosen column of data into a binary file
  • For example, run python benchmarks/text_to_binary.py lineitem.tbl <column number> <datatype> column_data.bin '|' to generate the binary dataset column_data.bin for TPC-H lineitem column <column number> using <datatype> as the type
  • Note: make sure that the delimiter is set correctly, default is ,

Run tests:

  • Run ./bin/benchmark_cascaded or ./bin/benchmark_lz4 with -f BIN:column_data.bin <options> to measure throughput.

Below are some reference benchmark results on a Tesla V100 for TPC-H.

Example of compressing and decompressing TPC-H SF10 lineitem column 0 (L_ORDERKEY) using Cascaded with RLE + Delta + Bit-packing (input stream is treated as 8-byte integers):

$ ./bin/benchmark_cascaded -f BIN:/tmp/lineitem-col0-long.bin -t long -r 1 -d 1 -b 1 -m
----------
uncompressed (B): 479888416
compression memory (input+output+temp) (B): 2400694079
compression temp space (B): 1200972879
compression output space (B): 719832784
comp_size: 15000160, compressed ratio: 31.99
compression throughput (GB/s): 184.06
decompression memory (input+output+temp) (B): 930155136
decompression temp space (B): 435266560
decompression throughput (GB/s): 228.53

Example of compressing and decompressing TPC-H SF10 lineitem column 15 (L_COMMENT) using LZ4:

$ ./bin/benchmark_lz4 -f BIN:lineitem-col15-string.bin -m
----------
uncompressed (B): 2579400236
compression memory (input+output+temp) (B): 5303691651
compression temp space (B): 134775815
compression output space (B): 2589515600
comp_size: 1033539977, compressed ratio: 2.50
compression throughput (GB/s): 1.53
decompression memory (input+output+temp) (B): 3612940213
decompression temp space (B): 0
decompression throughput (GB/s): 26.16

Note: Your TPC-H performance results may not precisely match the output above, since the TPC-H data generator produces random data - you need to use the same seed to have byte-matching input files.

How to use the library in your code

Further information about our compression algorithms

About

A library for fast lossless compression/decompression on the GPU

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 77.5%
  • Cuda 18.1%
  • C 3.4%
  • Other 1.0%