Compressing JSON: gzip vs zstd

JSON is the de facto standard for exchanging data on the Internet. It is relatively simple text format inspired by JavaScript. I say “relatively simple” because you can read and understand the entire JSON specification in minutes.

Though JSON is a concise format, it is also better used over a slow network in compressed mode. Without any effort, you can compress often JSON files by a factor of ten or more.

Compressing files adds an overhead. It takes time to compress the file, and it takes time again to uncompress it. However, it may be many times faster to send over the network a file that is many times smaller. The benefits of compression go down as the network bandwidth increases. Given the large gains we have experienced in the last decade, compression is maybe less important today. The bandwidth between nodes in a cloud setting (e.g., AWS) can be gigabytes per second. Having fast decompression is important.

There are many compression formats. The conventional approach, supported by many web servers, is gzip. There are also more recent and faster alternatives. I pick one popular choice: zstd.

For my tests, I choose a JSON file that is representative of real-world JSON: twitter.json. It is an output from the Twitter API.

Generally, you should expect zstd to compress slightly better than gzip. My results are as follow using standard Linux command-line tools with default settings:

uncompressed 617 KB
gzip (default) 51 KB
zstd (default) 48 KB

To test the decompression performance, I uncompress repeatedly the same file. Because it is a relatively small file, we should expect disk accesses to be buffered and fast.

Without any tweaking, I get twice the performance with zstd compared to the standard command-line gzip (which may differ from what your web server uses) while also having better compression. It is win-win. Modern compression algorithms like zstd can be really fast. For a fairer comparison, I have also included Eric Biggers’ libdeflate utility. It comes out ahead of zstd which stresses once more the importance of using good software!

gzip 175 MB/s
gzip (Eric Biggers) 424 MB/s
zstd 360 MB/s

My script is available. I run it under a Ubuntu system. I can create a RAM disk and the numbers go up slightly.

I expect that I understate the benefits of a fast compression routines:

    1. I use a docker container. If you use containers, then disk and network accesses are slightly slower.
    2. I use the standard command-line tools. With a tight integration of the software libraries within your software, you can probably avoid many system calls and bypass the disk entirely.

Thus my numbers are somewhat pessimistic. In practice, you are even more bounded by computational overhead and by the choice of algorithm.

The lesson is that there can be large differences in decompression speed and that these differences matter. You ought to benchmark.

What about parsing the uncompressed JSON? We have demonstrated that you can often parse JSON at 3 GB/s or better. I expect that, in practice,  you can make JSON parsing almost free compared to compression, disk and network delays.

Update: This blog post was updated to include Eric Biggers’ libdeflate utility.

Note: There has been many requests for more to expand this blog post with various parameters and so forth. The purpose of the blog post was to illustrate that there are large performance differences, not to provide a survey of the best techniques. It is simply out of the scope of the current blog post to identify the best approach. I mean to encourage you to run your own benchmarks.

See also: Cloudflare has its own implementation of the algorithm behind gzip. They claim massive performance gains. I have not tested it.

Further reading: Parsing Gigabytes of JSON per Second, VLDB Journal 28 (6), 2019

Daniel Lemire, "Compressing JSON: gzip vs zstd," in Daniel Lemire's blog, June 30, 2021, https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

9 thoughts on “Compressing JSON: gzip vs zstd”

  1. Blosc has demonstrated that using a really fast codec with modest compression ratio you can actually speed up processing relative to using uncompressed data by relieving load on the bottleneck of main memory fetching. This only helps for a large number of threads, though. Not single cpu performance.

    But if this can be true for DRAM, it can definitely be relevant to disk and network. So while .json.zstd may be good over the internet I expect .json.lz4 to be beneficial almost always.

    I wonder how fast a tightly coupled lz4-json decoder with an intermediate buffer size optimized for L1 cache can get.

  2. gzip is a legacy format with a lot of design choices from a bygone era. Therefore, zstd has many inherent benefits. Therefore, it is not surprising that zstd dominates the Pareto frontier. However, there are several gz format tools that vastly outperform gzip for compression and decompression. You note CloudFlare, however if you are focused on de-compression it is worth looking at libdeflate or Intel’s igzip which has extremely fast decompression on x86-64 machines. Related, for web applications Google’s Chrome does include a number of optimizations (some of which have found there way into CloudFlare).

  3. IMHO: the correct scripts Link :

    https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2021/06/30

    related: The PostgreSQL hackers discussing ZSON (improvements) ( 2021-may-jun )

    https://www.postgresql.org/message-id/flat/4aca1d4c-aa07-c168-bcca-236ec9f04c8d%40dunslane.net#ed814406717fc0915178261de7fd7e4a

    ZSON = “ZSON is a PostgreSQL extension for transparent JSONB compression. Compression is based on a shared dictionary of strings most frequently used in specific JSONB documents (not only keys, but also values, array elements, etc).” https://github.com/postgrespro/zson

  4. Awesome I love a good compression discussion. Zstd blew my mind and in fact I read the zip file standard just added support for the zstd compression standard. So many open source options have been blowing away legacy proprietary compression lately – it’s a great time to be alive.

    In addition to Facebook’s Zstd Google gave us Brotli, which is tuned for compressing web content like HTML and JSON. Brotli is slower than Zstd but it has a static dictionary based on the most common words or utf8 on the internet, and often compresses my JS/JSON 5-30% better than even Zstd. For example here is an OpenAPI schema that’s 412KB raw, 28k-38k in Zstd, and 26k in Brotli. Zstd has blown Brotli out of the water in speed but Brotli still compresses better for static compression of text like this.

    412K test.json
    26K test.json.br
    35K test.json.gz
    38K test.json.zst (default level)
    28K test.json.zst (max level 19)

  5. Dan, which application did you use for gzip compression? This is important because it impacts the results. You mentioned that web servers use gzip. Note that the application / library they use is actually zlib. However, the command line application that comes with many Linux and Unix-based OSes is GNU Gzip. I can’t tell from your script which of these you used, but I’m guessing it was GNU Gzip, which is not what web servers use.

    This is GNU Gzip: https://www.gnu.org/software/gzip/

    This is zlib: https://github.com/madler/zlib

    They’re totally different projects and codebases. The zlib library can generate three different formats: deflate, gzip, and the official zlib format. They all use deflate, but the headers are different. Web servers like Apache and nginx generate gzip files using the zlib library.

    Also, the compression levels you used are unclear. You didn’t specify any in your script, so it would be the defaults. There’s nothing special or authoritative about the defaults for benchmarking purposes, so it would be worth trying at least a few levels. I have no idea what the gzip default compression level is for either GNU Gzip or zlib, but for Zstd it’s level 3. That’s out of 22 possible levels, so it’s near the lowest ratio Zstd produces. The difference between your gzip and Zstd results would likely be much greater if you tried higher levels, since Zstd improves dramatically as you go up the levels, whereas gzip generally doesn’t improve much beyond level 6 (out of 9).

    Note that the current benchmarks for gzip compression are not GNU Gzip or zlib, which are old projects that emphasize compatibility with old computers. The benchmarks are libdeflate (by Eric Biggers) and zopfli (by Google, probably Jyrki Alakuijala and others). They both compress better than zlib, and libdeflate is also much faster (zopfli is super slow). The Cloudflare fork of zlib hasn’t been maintained, and it wasn’t actually usable or buildable last time I checked.

    https://github.com/ebiggers/libdeflate

    https://github.com/google/zopfli

      1. libdeflate offers more compression levels too. I think it’s 1 to 12, compared to 1 to 9 for typical gzip implementations. Those levels do in fact deliver better compression than legacy libraries like zlib – they’re not just there for granularity. e.g. libdeflate 12 compresses more than zlib 9 (and libdeflate 9 probably compresses more than zlib 9, and faster).

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.