Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEON implementation for Adler32 #251

Closed
wants to merge 2 commits into from
Closed

Conversation

Adenilson
Copy link

The checksum is calculated in the uncompressed PNG data and can be made much faster by using SIMD.

Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime was 350ms x 125ms for 4096x4096 bytes executed 30 times). That results in at least 18% improvement in PNG image decoding in Chromium.

Further details at:
https://bugs.chromium.org/p/chromium/issues/detail?id=688601

@Adenilson Adenilson force-pushed the NEON03 branch 2 times, most recently from bd6476f to aa4d660 Compare April 12, 2017 19:13
@ProgramMax
Copy link

Hello everyone. [email protected] / @chromium.org here.

Don't forget to add yourself to contrib/README.contrib

The checksum is calculated in the uncompressed PNG data
and can be made much faster by using SIMD.

Tests in ARMv8 yielded an improvement of about 3x
(e.g. walltime was 350ms x 125ms for a 4096x4096 bytes
executed 30 times). That results in at least 18% improvement
in image decoding in Chromium.

Further details at:
https://bugs.chromium.org/p/chromium/issues/detail?id=688601
@Adenilson
Copy link
Author

@ProgramMax nice catch! Fixed.

CRC32 affects performance for both image decompression (PNG) as also in
general browsing while accessing websites that serve content using
compression (i.e. Content-Encoding: gzip).

This first patch implements an optimized CRC32 function using the
dedicated instruction available in ARMv8.

It should be between 6x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to
10x faster (A72: 91ms x 9ms) than the C implementation currently used by zlib.

Details:
https://bugs.chromium.org/p/chromium/issues/detail?id=709716

Change-Id: I069408ebc06c49a3c2be4ba3253319e025ee09d7
@Adenilson
Copy link
Author

Just uploaded a function that uses the new ARMv8 crc32 instruction (it is about 10x faster than the C function in zlib).

To build using the optimized versions, just have an ARM compiler (e.g. from linaro, https://releases.linaro.org/components/toolchain/binaries/6.3-2017.02/) and export CC to point to it (e.g. export CC=arm-linux-gnueabihf-gcc-5) and next enable the feature in CMake buildsystem, either using the ncurses GUI (i.e. ccmake ..) or passing the options as:
cmake .. -DARMv8=ON -DARMv8CRC=ON

I didn't touch the configure script and tried my best to integrate the ARM specific functions using the same strategy used for AMD64 and ASM686 specific code.

@Adenilson
Copy link
Author

Also validated the changes with 'make test':
(xenial)adenilson@localhost:~/canonical-fork/build$ make test
Running tests...
Test project /home/adenilson/canonical-fork/build
Start 1: example
1/2 Test #1: example .......................... Passed 0.01 sec
Start 2: example64
2/2 Test #2: example64 ........................ Passed 0.01 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) = 0.01 sec

@Adenilson
Copy link
Author

Touching the base on this one as it has passed over 1 month already.

@Adenilson
Copy link
Author

Friendly ping on the matter.

@Adenilson
Copy link
Author

@madler anything you like to change on the original patch?

@Adenilson
Copy link
Author

@madler ping?

@diizzyy
Copy link

diizzyy commented Sep 4, 2017

@Adenilson
Is this ARMv8 only or does it also work on ARMv7?

@Adenilson
Copy link
Author

First patch should work in both ARMv7 (assuming that the SoC has a NEON unit) + ARMv8.

Second (CRC32) is ARMv8 (both AArch32 and AArch64) only.

@Adenilson
Copy link
Author

@madler friendly ping?

@Adenilson
Copy link
Author

For record, Chrome M62 is shipping the inflate_fast optimization and Chrome M63 has a variant of the Adler-32 optimization.

I'm working towards having the patch using the ARMv8 crc32 instruction included in M64 (branching in 2 weeks).

uint32_t armv8_crc32_little(uint32_t crc,
const unsigned char *buf,
size_t len) {
uint32_t c;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a second thought, this should handle the case of buf == Z_NULL.

@Adenilson
Copy link
Author

Adenilson commented Dec 13, 2017

An update on the issue: the crc32 optimization landed on Chrome M64 but it was reverted because it broke the build for android_x86 and android_x64.

I'm working in a new version for the crc32 optimization, using __crc32d() as it seems to be up to 2x faster than the original ARMv8 crc32 code (which was itself 10x faster than the vanilla C code).

I've posted some data in:
https://bugs.chromium.org/p/chromium/issues/detail?id=709716#c22

It can be from 32x to 45x faster (big core X little core) than the original zlib C function for vectors of 8KB.

As soon we land this on Chromium I'm planning to update this merge request.

Some samples:

random data length 16384 bytes
--test-- crc32 Min time Max Rate Median time Median Rate
crc32_zlib d2a888f6 65 u-secs 0.24 MB/s 65 u-secs 0.24 MB/s
crc32_neon d2a888f6 1 u-secs 15.62 MB/s 1 u-secs 15.62 MB/s

random data length 8192 bytes
--test-- crc32 Min time Max Rate Median time Median Rate
crc32_zlib 09bc142e 32 u-secs 0.24 MB/s 33 u-secs 0.24 MB/s
crc32_neon 09bc142e 0 u-secs inf MB/s 1 u-secs 7.81 MB/s

jow- pushed a commit to lede-project/source that referenced this pull request Jan 2, 2018
This adds two optimizations for ARM:
NEON optimized Adler(-)32 checksum algorithm (ARMv7 and newer NEON CPUs)
ARM(v7+) specific optimization for inflate
I've also connected inflate optimization to the build using the following
source as template.
mirror/chromium@0397489#diff-a62ad2db6c83dbc205d34bb9a8884f16

Additional info:
https://codereview.chromium.org/2676493007/
https://codereview.chromium.org/2722063002/

Sources:
madler/zlib#251 (only the first commit)
madler/zlib#256

Signed-off-by: Daniel Engberg <[email protected]>
SpiralP pushed a commit to SpiralP/lede-source that referenced this pull request Jan 2, 2018
This adds two optimizations for ARM:
NEON optimized Adler(-)32 checksum algorithm (ARMv7 and newer NEON CPUs)
ARM(v7+) specific optimization for inflate
I've also connected inflate optimization to the build using the following
source as template.
mirror/chromium@0397489#diff-a62ad2db6c83dbc205d34bb9a8884f16

Additional info:
https://codereview.chromium.org/2676493007/
https://codereview.chromium.org/2722063002/

Sources:
madler/zlib#251 (only the first commit)
madler/zlib#256

Signed-off-by: Daniel Engberg <[email protected]>
jollaman999 pushed a commit to jollaman999/openwrt that referenced this pull request Jan 13, 2018
This adds two optimizations for ARM:
NEON optimized Adler(-)32 checksum algorithm (ARMv7 and newer NEON CPUs)
ARM(v7+) specific optimization for inflate
I've also connected inflate optimization to the build using the following
source as template.
mirror/chromium@0397489#diff-a62ad2db6c83dbc205d34bb9a8884f16

Additional info:
https://codereview.chromium.org/2676493007/
https://codereview.chromium.org/2722063002/

Sources:
madler/zlib#251 (only the first commit)
madler/zlib#256

Signed-off-by: Daniel Engberg <[email protected]>
@Adenilson
Copy link
Author

Friendly ping on the subject.

Any hope to merge this upstream?

@tbeu
Copy link
Contributor

tbeu commented Jan 24, 2018

Any hope to merge this upstream?

Hope never dies.

@Adenilson
Copy link
Author

We did further testing, the average gain of the ARMv8 crc32 in PNG decoding ranges from 2.1% (140PNGs corpus) to 3.2% (Doodles) and up to 7.1% (Kodak).

The latest version of the crc32 patch (https://chromium-review.googlesource.com/c/chromium/src/+/801108) has code to perform the CPU feature detection on ARM.

@Adenilson
Copy link
Author

@Adenilson
Copy link
Author

The 140PNGs corpus is internal to Chromium developers (and access is granted by request).

@Adenilson
Copy link
Author

@madler friendly ping on the subject.

The crc32 optimized function improved the performance decompressing gzipped content in 29%.

@Adenilson
Copy link
Author

Hey, look at this... we are just about to complete 1 year (since the pull request)!
:-)

@Adenilson
Copy link
Author

The ARMv8-a optimized crc32 code (an improved version of it) landed in Chromium almost 1 month ago and made the cut to Chromium M66.

If all goes well, M66 will be shipping to users in the first week of May with this optimization enabled for ARM.

@Adenilson
Copy link
Author

Adenilson commented Apr 11, 2018

Look at this, time really flies: 1 (one) year since the request was open!

To celebrate the anniversary, I opened another merge request (the inflate_fast NEON optimization, should be around 20% faster at decompression in average) in:
#345

And if one day we manage to merge the Chromium optimizations (ARM + Intel):
#346

@Adenilson
Copy link
Author

Adenilson commented Apr 25, 2018

Friendly ping... @madler any comment?

@Adenilson
Copy link
Author

@madler any comment?

@Adenilson
Copy link
Author

@madler ping?

antonlacon added a commit to antonlacon/LibreELEC.tv that referenced this pull request Aug 20, 2018
This introduces arm/neon optimizations to zlib.

The first two patches are a neon optimization relating to zlib's
inflate function. They increase decompression speed. It has been
shipping in Chromimum since release 62 (Oct. 2017). The patches have
been pulled from a PR to zlib upstream: madler/zlib#345.

Patches 003 and 004 have been pulled from Fedora Core's aarch64 zlib
package. They improve zlib compression speed and have been there for
4 months.

Patch 005 is pulled from a PR to zlib upstream. madler/zlib#251. It's
been shipping in Chromium since release 63, and increases
decompression speed.

Patch 006 is my own to allow 005 to merge without conflict with the
previous patches.

Signed-off-by: Ian Leonard <[email protected]>
@Adenilson
Copy link
Author

It is being over 1 1/2 year (i.e. 18 months), I'm closing this merge request for the given reasons:

a) We are shipping faster (and better tested) versions of this optimizations on Chromium.
b) These optimizations have also being ported to x86.

For anyone interested, please check the latest code in:
a) NEON-ized Adler-32: https://cs.chromium.org/chromium/src/third_party/zlib/adler32_simd.c?l=203
b) ARMv8 crypto accelerated CRC32: https://cs.chromium.org/chromium/src/third_party/zlib/crc32_simd.c?l=159

@Adenilson Adenilson closed this Oct 10, 2018
@tbeu
Copy link
Contributor

tbeu commented Oct 11, 2018

Too sad.

@praiskup
Copy link

Mark, I'm curious what would help with reviews of similar contributions like this one:

  • is this something which should never go to zlib?
  • is this lack of time only? Could you e.g. specify how to become a contributor to zlib, so you could enlarge the set of people you trust enough to do the code reviews?
  • would some automatic testing help here? I mean, if we had some HW being able to test the added code, would you be more confident that the code from pull request is acceptable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants