Add optimization for Adler32 checksum for Power processors #458

racardoso · 2019-12-10T19:33:10Z

Hi,

This PR introduces a optimization for Adler32 checksum for POWER8+ processors that uses VSX (vector) instructions.

If adler32 do 1 byte at time on the first iteration s1 is s1_0 (_n means iteration n) is the initial value of adler, at beginning _0 is 1 unless adler initial value is different than 1. So s1_1 = s1_0 + c[0] after
the first calculation. For the next iteration s1_2 = s1_1 + c[1] and so on. Hence, for iteration N, s1_N = s1_(N-1) + c[N] is the value of s1 on after iteration N. Therefore, for s2, s2_N = s2_0 + Ns1_N + Nc[0] + N-1*c[1] + ... + c[N] In a more general way:

s1_N = s1_0 + sum(i=1 to N)*c[i]

s2_N = s2_0 + N*s1 + sum (i=1 to N)(N-i+1)*c[i]

Where s1_N, s2_N are the values for s1, s2 after N iterations. So if we can process N-byte at time we can obtain adler32 checksum for N-byte at once. Since VSX can support 16-byte vector instructions, we can process 16-byte at time using N = 16 we have:

s1 = s1_16 = s1_0 + sum(i=1 to 16)c[i]

s2 = s2_16 = s2_0 + 16*s1 + sum(i=1 to 16)(16-i+1)*c[i]

The VSX version starts to improve the performance for buffers with size >= 64. The performance is up to 10x better than Adler32 version from adler32 non-vectorized version (average cpu time in ns on 100000 iterations):

buffer size	adler32 baseline	adler32 power	speedup
64	44.921875	41.015625	-
1024	943.359375	130.859375	7.2
10*5552	42519.531250	3974.609375	10.7

For buffer with length <= than 64 the performance is almost the same of
the non-vectorized implementation (with a small performance degradation in some cases):

buffer size	adler32 baseline	adler32 power
NULL	5.859375	6.812500
1	3.906250	4.859375
15	11.718750	12.625000
48	35.156250	33.203125

mscastanho · 2019-12-10T20:33:49Z

FYI this PR uses the same base commit as #457 to add base code for Power optimizations. When either one gets accepted, the other can be rebased to remove the first commit from the PR.

Optimized functions for Power will make use of GNU indirect functions, an extension to support different implementations of the same function, which can be selected during runtime. This will be used to provide optimized functions for different processor versions. Since this is a GNU extension, we placed the definition of the Z_IFUNC macro under `contrib/gcc`. This can be reused by other archs as well. Author: Matheus Castanho <[email protected]> Author: Rogerio Alves <[email protected]>

This commit implements a Power (POWER8+) vector optimization for Adler32 checksum using VSX (vector) instructions. The VSX adler32 checksum is up to 10x fast than the adler32 baseline code. Author: Rogerio Alves <[email protected]>

This commit add tests for adler32 vector optimization for Power (POWER8+). Author: Rogerio Alves <[email protected]>

mscastanho mentioned this pull request Dec 12, 2019

Add optimized longest_match for Power processors #459

Open

nmoinvaz mentioned this pull request Jan 17, 2020

Add AltiVec-optimized adler32 and slide_hash for PowerPC zlib-ng/zlib-ng#109

Merged

mscastanho mentioned this pull request Feb 3, 2020

Adding CPU features detection code #468

Open

Rogerio Alves added 3 commits March 11, 2020 10:55

Adler32 vector optimization for Power.

bdb8025

This commit implements a Power (POWER8+) vector optimization for Adler32 checksum using VSX (vector) instructions. The VSX adler32 checksum is up to 10x fast than the adler32 baseline code. Author: Rogerio Alves <[email protected]>

Tests for Adler32 vector optimization for Power.

5e27408

This commit add tests for adler32 vector optimization for Power (POWER8+). Author: Rogerio Alves <[email protected]>

racardoso force-pushed the vec-adler32-power branch from 6cb701f to 5e27408 Compare March 11, 2020 13:58

adler32_test: Fix warning when compiling with -Wall

6d578f4

mscastanho mentioned this pull request Jun 18, 2020

Add optimized adler32 for POWER and new adler32 tests zlib-ng/zlib-ng#647

Merged

Fix invalid memory access on ppc and ppc64

40c4f6d

Neustradamus mentioned this pull request Aug 23, 2023

IBM Power Processors and Zlib #847

Open

Neustradamus mentioned this pull request Jan 1, 2025

CMake and Zlib #831

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimization for Adler32 checksum for Power processors #458

Add optimization for Adler32 checksum for Power processors #458

racardoso commented Dec 10, 2019

mscastanho commented Dec 10, 2019

Add optimization for Adler32 checksum for Power processors #458

Are you sure you want to change the base?

Add optimization for Adler32 checksum for Power processors #458

Conversation

racardoso commented Dec 10, 2019

mscastanho commented Dec 10, 2019