Add optimization for Adler32 checksum for Power processors #458
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
This PR introduces a optimization for Adler32 checksum for POWER8+ processors that uses VSX (vector) instructions.
If adler32 do 1 byte at time on the first iteration s1 is s1_0 (_n means iteration n) is the initial value of adler, at beginning _0 is 1 unless adler initial value is different than 1. So s1_1 = s1_0 + c[0] after
the first calculation. For the next iteration s1_2 = s1_1 + c[1] and so on. Hence, for iteration N, s1_N = s1_(N-1) + c[N] is the value of s1 on after iteration N. Therefore, for s2, s2_N = s2_0 + Ns1_N + Nc[0] + N-1*c[1] + ... + c[N] In a more general way:
s1_N = s1_0 + sum(i=1 to N)*c[i]
s2_N = s2_0 + N*s1 + sum (i=1 to N)(N-i+1)*c[i]
Where s1_N, s2_N are the values for s1, s2 after N iterations. So if we can process N-byte at time we can obtain adler32 checksum for N-byte at once. Since VSX can support 16-byte vector instructions, we can process 16-byte at time using N = 16 we have:
s1 = s1_16 = s1_0 + sum(i=1 to 16)c[i]
s2 = s2_16 = s2_0 + 16*s1 + sum(i=1 to 16)(16-i+1)*c[i]
The VSX version starts to improve the performance for buffers with size >= 64. The performance is up to 10x better than Adler32 version from adler32 non-vectorized version (average cpu time in ns on 100000 iterations):
For buffer with length <= than 64 the performance is almost the same of
the non-vectorized implementation (with a small performance degradation in some cases):