-
Notifications
You must be signed in to change notification settings - Fork 826
Support SVE with assembly implementation #762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I presume we'll wait for the currently ongoing issues to be fixed (CI tests, notably for SVE mode, broken dispatch) before reviewing this PR. |
Rename xxh_x86dispatch.h to xxh_dispatch.h since it will be shared to other architectures. Signed-off-by: Haojian Zhuang <[email protected]>
Dispatch SVE, NEON and SCALAR implementations on arm64 by selecting different macros. Since SVE implementation can't be supported by compiler well, use assembly code instead. In intrinsic SVE implementation: * Avoid to access the ACC array in memory frequently in accumulation routine. In assembly SVE implementation (dispatcher): * Avoid to access the ACC array in memory frequently in accumulation routine. * Use assemly code in scramble routine. * Since there's both accumulation and scramble routine in internal loop, convert the internal loop to assembly version. At this time, avoid to access the ACC array in memory frequently in the internal loop. Signed-off-by: Haojian Zhuang <[email protected]>
Make bench tests to support assembly SVE routine. While SVE intrinsic implementation is enabled, the building commands and performance data are in below. $export CPP_FLAGS="-DXXH_VECTOR=XXH_SVE" $export CFLAGS="-O3 -march=armv8-a+sve -fPIC -DXXH_VECTOR=XXH_SVE" $make === benchmarking 4 hash functions === benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27) xxh3 , 3679, 6019, 7807, 8945, 9862, 10343, 10622, 10604, 10782, 10697, 10763, 10900, 10913, 9959, 6374, 5979, 6057, 6076, 6108 XXH32 , 1326, 1440, 1495, 1523, 1534, 1541, 1545, 1534, 1505, 1506, 1507, 1506, 1508, 1456, 1248, 1195, 1199, 1201, 1200 XXH64 , 2510, 2803, 2978, 3072, 3121, 3139, 3155, 3127, 3051, 3046, 3059, 3060, 3059, 2899, 2117, 1983, 1991, 1993, 1991 XXH128 , 3421, 5791, 7501, 8891, 9787, 10363, 10646, 10435, 10809, 10935, 10974, 10999, 11002, 9916, 6099, 5773, 6110, 6109, 6119 While SVE asembly implementation is enabled, the building commands and performance data are in below. $export CPP_FLAGS="-DXXH_VECTOR=XXH_SVE" $export CFLAGS="-O3 -march=armv8-a+sve -fPIC -DXXH_VECTOR=XXH_SVE" $make DISPATCH=1 === benchmarking 4 hash functions === benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27) xxh3 , 4142, 6663, 9745, 12327, 13990, 15064, 15631, 15515, 15412, 14055, 14105, 14135, 14126, 11953, 4585, 4000, 4013, 4042, 4033 XXH32 , 1326, 1440, 1495, 1523, 1535, 1543, 1547, 1536, 1500, 1503, 1503, 1502, 1485, 1452, 1243, 1192, 1199, 1199, 1197 XXH64 , 2499, 2760, 2975, 3071, 3122, 3137, 3153, 3133, 3041, 3044, 3015, 3051, 3030, 2897, 2124, 1977, 1988, 1989, 1967 XXH128 , 3903, 6454, 9485, 11954, 13807, 15135, 15891, 15381, 15376, 15442, 15678, 15677, 15728, 13096, 4698, 4132, 4046, 4044, 4051 Signed-off-by: Haojian Zhuang <[email protected]>
If an assembler source contains no GNU-stack note, the system by default assumes that an executable stack may be required. GCC generates code to be executed on the stack when it implements a trampoline for nested functions. The default behavior brings out security issue. Signed-off-by: Haojian Zhuang <[email protected]>
Rebase the patch set since CI issue has been fixed. |
This is a more complex PR, it will take some time to get through it. To begin with, we may have another opportunity to divide and conquer here, and that would make each part easier to review. It seems there are 2 combined efforts that could be isolated :
|
Speaking of Assembly version : I was lucky enough to access a server with Nonetheless, this was a good opportunity to compare the And the difference was quite small, approximately +5% in favor of assembly. Now, this could be because This matters, because assembly introduces a substantial build difficulty, so it should be matched by some corresponding benefit. If +5% is about the right expectation, then mainline is probably not the best target for it (though it is still a good reason to create a specialized |
From a small amount of digging. c7g (AWS Graviton 3) seems to be based on the Neoverse V1, which is SVE-256. Looking at the optimization guide:
However, I question how much of the C performance is just compilers not being quite ready yet and in a few months the entire code will be obsolete (especially since now there are SVE compatible machines that are more readily available now). |
So your guess is that |
Yes. If Graviton3 is an ARM design and not designed in-house, it is likely just as sensitive to instruction ordering and pipelining as the Cortexes I've been fiddling with. |
As for the interleaved SVE, try replacing the sve256 loop block (L294 to L306) with this. Disclaimer, I haven't tested this and my ordering might not be ideal. 10:
// since vector length is known, avoid predicates for more freedom
ldr z19, [x1]
ldr z20, [x1, #1, mul vl]
ldr z21, [x2]
ldr z22, [x2, #1, mul vl]
prfd pldl1strm, p7, [x1, #31, mul vl]
eor z21.d, z21.d, z19.d
eor z22.d, z22.d, z20.d
tbl z23.d, {z19.d}, z7.d
tbl z24.d, {z20.d}, z7.d
lsr z25.d, z21.d, #32
and z21.d, z21.d, #0xFFFFFFFF
// Pretty sure this will forward to the add
mad z23.d, p7/m, z25.d, z21.d
// Encourage V1 usage
lsr z26.d, z22.d, #32
add z17.d, z17.d, z23.d
and z22.d, z22.d, #0xFFFFFFFF
mad z24.d, p7/m, z26.d, z22.d
add z18.d, z18.d, z24.d
add x1, x1, #64
add x2, x2, #8
add x10, x10, #1
cmp x10, x3
b.lt 10b |
Unfortunately, my access to this platform was very temporary. |
As a side note I wonder if it is beneficial to just use NEON on SVE-128. |
No biggie. |
Let me check whether I can get aws c7g.
I mainly worked on Fujitsu A64FX platform. The performance is improved a lot with assembly code. As I remember, A64FX is based on ARMv8.2, and AWS c7g maybe based on ARMv8.4.
Yes, I mainly verified it on Fujitsu A64FX.
|
By the way I did some research on the A64FX, and from what it appears, NEON has been severely performance-deprecated (as in everything but the trivial instructions having 6-12 cycles latency) to put more power into SVE. This is an understandable design choice because it is designed for specialized uses and not general purpose. So NEON being bad on this particular platform is unavoidable. |
SVE doesn't work as expected. I'll keep investigating. As my understanding, SVE should work a bit better than NEON. @easyaspi314 There's some issue in your code snippet. "LSR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>". If I switch to AND instruction from UXTW, I need one more instruction. So I could not gain benefit. |
I think that for now we should only do SVE-512. Looking at the optimization guide, c7g is a tradeoff because while SVE can process 2x the data, NEON always has at least 2x the IPC. (Also NEON already has 256-bit loads). Also scalar can still be executed in the background, and we lose that benefit unless we mix in NEON which would just be messy and not worth the complexity. However, with 100% optimal code, SVE will lose because the multiply is significantly slower and can't be parallelized:
(Parentheses are if the result can't be forwarded to another add instruction, which isn't the case in XXH3) The ARM pipeline is so fun 💀 SVE2 may have better performance because it can use |
I assume that SVE is more sensitive to cache.
Will take a look, thanks.
There are some issues with 32-bit ARM that I want to work out which would cause issues on the latter. A lot of it seems to be issues with load-store. If I write XXH3_accumulate_512_neon in 32-bit ARM inline asm, I can get a significant performance boost of up to 12 GB/s vs 7 GB/s. However, this is at a tradeoff of practicality — Google and Apple require all apps to have 64-bit support, so this will only really benefit Android devices that are at least 6 years old and armv7l Linux. I want the main target to be AArch64 and only go for low-footprint optimizations for 32-bit ARM.
Hopefully. It unfortunately seems like ARM has gone back to 128-bit SVE for its newest designs, including the V2. This might be because, as we see on c7g, the throughput penalty makes it not that much better than 2 NEON instructions.
c7g does not have SVE2. However, if you have access to an SVE2 machine with a known 128-bit vector size, try replacing Also make sure to put typedef svuint64_t xxh_u64x2 __attribute__((arm_sve_vector_bits(128)));
typedef svuint32_t xxh_u32x4 __attribute__((arm_sve_vector_bits(128)));
XXH_FORCE_INLINE void XXH3_accumulate_512_neon(
void *XXH_RESTRICT acc,
const void *XXH_RESTRICT input,
const void *XXH_RESTRICT secret
)
{
size_t i;
uint64x2_t *xacc = (uint64x2_t *) acc;
const uint8_t *xinput = (const uint8_t *)input;
const uint8_t *xsecret = (const uint8_t *)secret;
XXH_ASSERT(svcntd() == 2);
for (i = 6; i < 8; i++) {
XXH3_scalarRound(acc, input, secret, i);
}
for (i = 0; i < 3; i++) {
uint64x2_t data_vec = XXH_vld1q_u64(xinput + 16 * i);
uint64x2_t key_vec = XXH_vld1q_u64(xsecret + 16 * i);
uint64x2_t swapped = vextq_u64(data_vec, data_vec, 1);
uint64x2_t mixed_lo = veorq_u64(data_vec, key_vec);
/* (x << 32) | (x >> 32) */
uint32x4_t mixed_hi = vrev64q_u32(vreinterpretq_u32_u64(mixed_lo));
uint64x2_t mul = (uint64x2_t)(xxh_u64x2)
svmlalb_u64(
(xxh_u64x2)swapped,
(xxh_u32x4)mixed_lo,
(xxh_u32x4)mixed_hi
);
xacc[i] = vaddq_u64(mul, xacc[i]);
}
} When compiling, use these flags (also try both GCC and Clang)
|
@easyaspi314 Although both NEON and SVE2 support the multiplication from two 32-bit inputs to 64-bit result. They're totally different.
|
Yes, and the reason it is favorable is that instead of requiring the |
Sorry, I didn't get the point. After
After
But we hope |
Ah, you are confused because the uzp trick is for two vectors at once. This is for only one. Come to think of it this would actually have literally zero benefit over the two vector approach aside from a minor data dependency for things that are going to be executed 4 at a time with identical timings (on the performance cores) Specifically, instead of I take back what I said, SVE-128 is garbage, just use NEON 🤪 |
I tried to mix NEON and SVE2. At first, compiler reported error while I convert NEON data type to SVE type. Then, I switched to assembly. I met some strange error. While declaring |
I plan to simplify the patch set. In dispatch, it checks the cpu. If it's SVE512, turn to assembly routine. If it's not, turn to NEON routine. Could it be acceptable? |
Since it's assembly, I only mix NEON and SVE2. I'll mix Scalar, NEON & SVE2 later. With the help of SVE2, it could save one instruction and gain the performance. (RED vs GREEN) |
That difference might solely be from it being handwritten assembly. However, even if it wasn't, I'd say that even if it is interleaved with scalar it clearly isn't going to do much to be worth an entirely different target from NEON. I would say that the minimum to warrant something like that would be 20%, and it should be on a newer target that will be worth even more in the future and not an older one (hence why I wouldn't want an entire ARMv7-A NEON inline assembly implementation, even though it benefits performance) |
OK. How about SVE512 assembly code? Could it be accepted? |
I'd say yes, although I would recommend the following priority:
Also, dispatching on POSIX ELF would be trivial: #include <sys/auxv.h>
__attribute__((target("+sve")))
static int XXH_isSVE512(void)
{
return svcntd() >= 8;
}
__attribute__((constructor))
static void XXH_featureTest(void)
{
if ((getauxval(AT_HWCAP) & HWCAP_SVE) != 0 && XXH_isSVE512()) {
// SVE 512
} else {
// neon
}
} |
Agreed, |
We haven't looked at this PR in a while, I believe it's fair to say that our current code base is not ready to add an assembly file at this point. Merging Note that, In the future, we'll try to change the code base to support multiple files, which might open a door for assembly source code again, but that's for later. As for now, I note that this PR was also introducing other interesting tools and capability. |
I was wondering which parts of this PR could be salvaged for Another interesting technology presented in this PR is the However, there is a non-negligible difference : For This means, the hardware detection is not limited to "just" detecting Anyway, this will require a bit more work in order to be merged. At this point, I mostly wonder if this PR should remain opened, for reference, with the idea that a future |
Closing, due to lack of activity. While the topic of providing
|
With assembly implementation, performance on SVE could be improved continuously.