Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark: zlib-ng vs isa-l, zlib, libdeflate, brotli #1486

Open
powturbo opened this issue May 10, 2023 · 15 comments
Open

Benchmark: zlib-ng vs isa-l, zlib, libdeflate, brotli #1486

powturbo opened this issue May 10, 2023 · 15 comments

Comments

@powturbo
Copy link

powturbo commented May 10, 2023

TurboBench : Build or download executables and test with your own data.

Benchmark1:
TurboBench: Dynamic/Static web content compression benchmark

Benchmark 2:
turbobench silesia.tar -eigzip,0,1,2,3/zlib_ng,1,3,6,9/libdeflate,1,3,6,9,12/zlib,1,3,6,9/memcpy
Hardware: Lenovo Ideapad 5 pro - Ryzen 6600hs / (bold = pareto) MB=1.000.000

C Size ratio% C MB/s D MB/s Name
64677910 30.5 7.47 1133.66 libdeflate 12
66715898 31.5 43.04 1116.39 libdeflate 9
67511452 31.9 119.35 1127.36 libdeflate 6
67644075 31.9 15.55 483.82 zlib 9
68152563 32.2 27.79 734.94 zlib_ng 9
68228660 32.2 37.74 478.69 zlib 6
68914854 32.5 92.33 735.24 zlib_ng 6
70166917 33.1 185.35 1110.18 libdeflate 3
71068342 33.5 203.57 1085.40 libdeflate 2
72490921 34.2 138.61 694.45 zlib_ng 3
72968832 34.4 86.50 480.09 zlib 3
73505577 34.7 288.56 1075.72 libdeflate 1
75138353 35.5 271.18 1080.75 igzip 3
76571415 36.1 598.69 1047.07 igzip 2
77260023 36.5 127.47 448.95 zlib 1
78154519 36.9 615.11 1020.09 igzip 1
87551010 41.3 638.49 969.43 igzip 0
100929713 47.6 329.63 651.73 zlib_ng 1
211948544 100.0 16146.00 16117.76 memcpy
@KungFuJesus
Copy link
Contributor

I assume memcpy is just raw memory bandwidth with no compression? Both libdeflate and igzip have the advantage/disadvantage of not being forks from zlib but ground up implementations (on the flip side with incompatible APIs). Not to say there's no room for improvement for zlib-ng, but that is at least a footnote to be provided here. It looks like everybody's compression speed is a bit anemic. Naturally we'd expect compression to be slower, I guess, but I do wonder what's left on the table there without sacrificing compression ratios.

@nmoinvaz
Copy link
Member

nmoinvaz commented May 10, 2023

AFAIK libdeflate requires whole-buffer and does not support streaming. So if all you look at is just speed then libdeflate will always win.

On the higher levels zlib-ng compresses twice as fast as zlib, with slight decrease in compression size which is expected.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented May 10, 2023

Somewhat interesting is at level 6 the decompression speed is higher than level 3, but perhaps that speaks to having an inflate that is working more in inflate_fast rather than decompressing literals? I'm just guessing, I don't have profiles to really evaluate that delta, today. I've definitely wanted something that chews through literals in the main inflate loop faster for a while now.

Err well, all the implementations share that trait. I guess it really is just not having to rely on memory read bandwidth as much with those higher compression ratios.

@powturbo
Copy link
Author

I've added a web content benchmark now.
The average web page is 84k, streaming is not relevant here.

@powturbo powturbo changed the title Benchmark: zlib-ng vs isa-l, zlib, libdeflate Benchmark: zlib-ng vs isa-l, zlib, libdeflate, brotli May 10, 2023
@KungFuJesus
Copy link
Contributor

Heh, the sorting by size rather than throughput is throwing me off a bit. Looks like we don't do too much worse (in terms of compression throughput) than libdeflate at level 9, albeit with some trade-offs in compression.

Better compression algorithms of course, do better. But I'm not losing sleep over that, those things aren't zlib-ng's purview. It might be worth a weekend dive into the techniques libdeflate is benefiting from for decompression.

@nmoinvaz
Copy link
Member

Why libdeflate is faster than vanilla zlib for decompression:
https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/deflate_decompress.c#L32-L43

@KungFuJesus
Copy link
Contributor

KungFuJesus commented May 10, 2023

Why libdeflate is faster than vanilla zlib for decompression: https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/deflate_decompress.c#L32-L43

Word accesses rather than byte accesses when copying matches

Pretty sure we do this, at least.

Word accesses rather than byte accesses when reading input

I would hope we do this but I'll have to look at the main loop to be sure. I'm not entirely convinced we couldn't do multiple words at a time and try to decode every possible op at once.

Faster Huffman decoding combined with various DEFLATE-specific tricks

That merits a deeper dive to figure out what they're talking about.

Larger bitbuffer variable that doesn't need to be refilled as often

I think we're doing a form of this now.

On x86_64, a version of the decompression routine is compiled with BMI instructions enabled and is used automatically at runtime when supported.

100% doing this, now.

@Dead2
Copy link
Member

Dead2 commented May 10, 2023

I do envy the quality of docstrings that libdeflate has.

@nmoinvaz
Copy link
Member

nmoinvaz commented May 10, 2023

Additionally, libdeflate uses an extra 32KB hash table for 3-byte matches. I don't think we want to consume that much more memory.
https://github.com/ebiggers/libdeflate/blob/02dfa32da3ee3982c66278e714d2e21276dfb67b/lib/hc_matchfinder.h#L69-L74

@danielhrisca
Copy link

which one is the isa-l in the results?

@powturbo
Copy link
Author

igzip

@powturbo
Copy link
Author

powturbo commented Jul 20, 2023

Extended benchmark TurboBench: Dynamic/Static web content compression benchmark including zstd and memory usage.
zlib-ng memory allocation must be revised to allocate only the minimum necessary!

@Dead2
Copy link
Member

Dead2 commented Aug 6, 2023

@powturbo Can you provide some detail on how you ran the benchmarks and how memory was measured?
AFAIK, this memory usage is not possible with just zlib-ng the library, as the allocations are static and very small.
Were you using minigzip/minideflate or some other application for the benchmarks? Those might have/cause a memory leak that we are not aware of.

@powturbo
Copy link
Author

powturbo commented Aug 7, 2023

This done by TurboBench. The allocate/free functions are intercepted and the memory usage is monitored.
Build or download the linux TurboBench from releases and type
"./turbobench -ezlib_ng,6 file".
The memory & stack usage is reported in the "file.tbb" result file.

@ghuls
Copy link

ghuls commented Sep 3, 2024

AFAIK libdeflate requires whole-buffer and does not support streaming. So if all you look at is just speed then libdeflate will always win.

There is a fork of libdeflate now which added streaming support (and multithreaded compression/decompression: pgzip program): https://github.com/sisong/libdeflate/tree/stream-mt
ebiggers/libdeflate#335

# pigz with zlib-ng 2.21
$ timeit pigz -c -p 4 fragments.tsv | wc -c                                                                                                                                                   
                                                                                                                                                                                                                                                       
Time output:                                                                                                                                                                                                                                           
------------                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                       
  * Command: pigz -c -p 4 fragments.tsv                                                                                                                                                                   
  * Elapsed wall time: 0:25.04 = 25.04 seconds                                                                                                                                                                                                         
  * Elapsed CPU time:                                                                                                                                                                                                                                  
     - User: 99.90                                                                                                                                                                                                                                     
     - Sys: 1.23                                                                                                                                                                                                                                       
  * CPU usage: 403%                                                                                                                                                                                                                                    
  * Context switching:                                                                                                                                                                                                                                 
     - Voluntarily (e.g.: waiting for I/O operation): 90798                                                                                                                                                                                            
     - Involuntarily (time slice expired): 218                                                                                                                                                                                                         
  * Maximum resident set size (RSS: memory) (kiB): 7408                                                                                                                                                                                                
  * Number of times the process was swapped out of main memory: 0                                                                                                                                                                                      
  * Filesystem:                                                                                                                                                                                                                                        
     - # of inputs: 0                                                                                                                                                                                                                                  
     - # of outputs: 0                                                                                                                                                                                                                                 
  * Exit status: 0                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                       
1132266009 -> compressed size

# libdeflate fork with streaming support (4 compression threads).
# RSS usage is small compared with normal libdeflate.
$ timeit ./libdeflate_streaming/pgzip/pgzip -c fragments.tsv | wc -c

Time output:
------------

  * Command: ./libdeflate_streaming/pgzip/pgzip -c fragments.tsv
  * Elapsed wall time: 0:18.02 = 18.02 seconds
  * Elapsed CPU time:
     - User: 70.86
     - Sys: 0.82
  * CPU usage: 397%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 30841
     - Involuntarily (time slice expired): 169
  * Maximum resident set size (RSS: memory) (kiB): 28092
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

1095567700 -> compressed size

# libdeflate fork with streaming support (1 compression threads).
$ timeit ./libdeflate_streaming/pgzip/pgzip -c -p 1 fragments.tsv | wc -c

Time output:
------------

  * Command: ./libdeflate_streaming/pgzip/pgzip -c -p 1 fragments.tsv
  * Elapsed wall time: 1:12.28 = 72.28 seconds
  * Elapsed CPU time:
     - User: 71.02
     - Sys: 0.82
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 31250
     - Involuntarily (time slice expired): 359
  * Maximum resident set size (RSS: memory) (kiB): 6172
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

1095560363 -> compressed size

# original libdeflate
$ timeit ./libdeflate/gzip -c fragments.tsv | wc -c

Time output:
------------

  * Command: ./libdeflate/gzip -c fragments.tsv
  * Elapsed wall time: 1:15.49 = 75.49 seconds
  * Elapsed CPU time:
     - User: 73.95
     - Sys: 1.26
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 34372
     - Involuntarily (time slice expired): 370
  * Maximum resident set size (RSS: memory) (kiB): 5612864
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

1126453467 -> compressed size

cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib that ships with most distributions is fairly slow, even when
you allow for the fact that zlib itself is quite slow (vs lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

Creating a GB mbtiles before:

Creating a GB mbtiles after:

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib that ships with most distributions is fairly slow, even when
you allow for the fact that zlib itself is quite slow (vs lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

This is kinda distasteful, in that we're snapshotting an
external dependency. On the other hand, it builds quickly
and the snapshot is a direct copy of the upstream folders,
so it should be easy to update in the future if needed.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib that ships with most distributions is fairly slow, even when
you allow for the fact that zlib itself is quite slow (vs lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

This is kinda distasteful, in that we're snapshotting an
external dependency. On the other hand, it builds quickly
and the snapshot is a direct copy of the upstream folders,
so it should be easy to update in the future if needed.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib that ships with most distributions is fairly slow, even when
you allow for the fact that zlib itself is quite slow (vs lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

This is kinda distasteful, in that we're snapshotting an
external dependency. On the other hand, it builds quickly
and the snapshot is a direct copy of the upstream folders,
so it should be easy to update in the future if needed.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib that ships with most distributions is fairly slow, even when
you allow for the fact that zlib itself is quite slow (vs lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

Snapshotting an external dependency is sorta distasteful - I worry
about how much is too much from a maintenance POV. To mitigate that,
the snapshot is a direct copy of the upstream folders, so it should
be easy to update in the future if needed.

This also lets us drop the zlib1g-devel and boost-iostreams
dependencies, which is nice.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib that ships with most distributions is fairly slow, even when
you allow for the fact that zlib itself is quite slow (vs lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

Snapshotting an external dependency is sorta distasteful - I worry
about how much is too much from a maintenance POV. To mitigate that,
the snapshot is a direct copy of the upstream folders, so it should
be easy to update in the future if needed.

This also lets us drop the zlib1g-devel and boost-iostreams
dependencies, which is nice.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib that ships with most distributions is fairly slow, even when
you allow for the fact that zlib itself is quite slow (vs lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

Snapshotting an external dependency is sorta distasteful - I worry
about how much is too much from a maintenance POV. To mitigate that,
the snapshot is a direct copy of the upstream folders, so it should
be easy to update in the future if needed.

This also lets us drop the zlib1g-devel and boost-iostreams
dependencies, which is nice.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib implementation that ships with most distributions is fairly slow,
even when you allow for the zlib algorithm itself being quite slow (vs
lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

Snapshotting an external dependency is sorta distasteful - I worry
about how much is too much from a maintenance POV. To mitigate that,
the snapshot is a direct copy of the upstream folders, so it should
be easy to update in the future if needed.

This also lets us drop the zlib1g-devel and boost-iostreams
dependencies, which is nice.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib implementation that ships with most distributions is fairly slow,
even when you allow for the zlib algorithm itself being quite slow (vs
lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

Snapshotting an external dependency is sorta distasteful - I worry
about how much is too much from a maintenance POV. To mitigate that,
the snapshot is a direct copy of the upstream folders, so it should
be easy to update in the future if needed.

This also lets us drop the zlib1g-devel and boost-iostreams
dependencies, which is nice.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib implementation that ships with most distributions is fairly slow,
even when you allow for the zlib algorithm itself being quite slow (vs
lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

Snapshotting an external dependency is sorta distasteful - I worry
about how much is too much from a maintenance POV. To mitigate that,
the snapshot is a direct copy of the upstream folders, so it should
be easy to update in the future if needed.

This also lets us drop the zlib1g-devel and boost-iostreams
dependencies, which is nice.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 6, 2024
The zlib implementation that ships with most distributions is fairly slow,
even when you allow for the zlib algorithm itself being quite slow (vs
lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

Snapshotting an external dependency is sorta distasteful - I worry
about how much is too much from a maintenance POV. To mitigate that,
the snapshot is a direct copy of the upstream folders, so it should
be easy to update in the future if needed.

This also lets us drop the zlib1g-devel and boost-iostreams
dependencies, which is nice.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
cldellow added a commit to cldellow/tilemaker that referenced this issue Oct 20, 2024
The zlib implementation that ships with most distributions is fairly slow,
even when you allow for the zlib algorithm itself being quite slow (vs
lz4 and zstd).

There are faster implementations. `libdeflate` [1] is one such example.
According to benchmarks [2], it's maybe 2-3x faster than zlib.

This PR updates helper.cpp's compression routines to use libdeflate.

It saves ~2-3% of total execution time for me:

Creating a GB mbtiles (zlib):

```
real 2m1.706s
user 28m24.186s
sys 0m41.886s
```

Creating a GB mbtiles (libdeflate):

```
real 1m58.450s
user 27m32.579s
sys 0m51.848s
```

Snapshotting an external dependency is sorta distasteful - I worry
about how much is too much from a maintenance POV. To mitigate that,
the snapshot is a direct copy of the upstream folders, so it should
be easy to update in the future if needed.

This also lets us drop the zlib1g-devel and boost-iostreams
dependencies, which is nice.

[1]: https://github.com/ebiggers/libdeflate
[2]: zlib-ng/zlib-ng#1486
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants