Support SVE with assembly implementation #762

hzhuang1 · 2022-11-24T15:08:35Z

With assembly implementation, performance on SVE could be improved continuously.

Cyan4973 · 2022-12-01T23:19:26Z

I presume we'll wait for the currently ongoing issues to be fixed (CI tests, notably for SVE mode, broken dispatch) before reviewing this PR.

Rename xxh_x86dispatch.h to xxh_dispatch.h since it will be shared to other architectures. Signed-off-by: Haojian Zhuang <[email protected]>

Dispatch SVE, NEON and SCALAR implementations on arm64 by selecting different macros. Since SVE implementation can't be supported by compiler well, use assembly code instead. In intrinsic SVE implementation: * Avoid to access the ACC array in memory frequently in accumulation routine. In assembly SVE implementation (dispatcher): * Avoid to access the ACC array in memory frequently in accumulation routine. * Use assemly code in scramble routine. * Since there's both accumulation and scramble routine in internal loop, convert the internal loop to assembly version. At this time, avoid to access the ACC array in memory frequently in the internal loop. Signed-off-by: Haojian Zhuang <[email protected]>

Make bench tests to support assembly SVE routine. While SVE intrinsic implementation is enabled, the building commands and performance data are in below. $export CPP_FLAGS="-DXXH_VECTOR=XXH_SVE" $export CFLAGS="-O3 -march=armv8-a+sve -fPIC -DXXH_VECTOR=XXH_SVE" $make === benchmarking 4 hash functions === benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27) xxh3 , 3679, 6019, 7807, 8945, 9862, 10343, 10622, 10604, 10782, 10697, 10763, 10900, 10913, 9959, 6374, 5979, 6057, 6076, 6108 XXH32 , 1326, 1440, 1495, 1523, 1534, 1541, 1545, 1534, 1505, 1506, 1507, 1506, 1508, 1456, 1248, 1195, 1199, 1201, 1200 XXH64 , 2510, 2803, 2978, 3072, 3121, 3139, 3155, 3127, 3051, 3046, 3059, 3060, 3059, 2899, 2117, 1983, 1991, 1993, 1991 XXH128 , 3421, 5791, 7501, 8891, 9787, 10363, 10646, 10435, 10809, 10935, 10974, 10999, 11002, 9916, 6099, 5773, 6110, 6109, 6119 While SVE asembly implementation is enabled, the building commands and performance data are in below. $export CPP_FLAGS="-DXXH_VECTOR=XXH_SVE" $export CFLAGS="-O3 -march=armv8-a+sve -fPIC -DXXH_VECTOR=XXH_SVE" $make DISPATCH=1 === benchmarking 4 hash functions === benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27) xxh3 , 4142, 6663, 9745, 12327, 13990, 15064, 15631, 15515, 15412, 14055, 14105, 14135, 14126, 11953, 4585, 4000, 4013, 4042, 4033 XXH32 , 1326, 1440, 1495, 1523, 1535, 1543, 1547, 1536, 1500, 1503, 1503, 1502, 1485, 1452, 1243, 1192, 1199, 1199, 1197 XXH64 , 2499, 2760, 2975, 3071, 3122, 3137, 3153, 3133, 3041, 3044, 3015, 3051, 3030, 2897, 2124, 1977, 1988, 1989, 1967 XXH128 , 3903, 6454, 9485, 11954, 13807, 15135, 15891, 15381, 15376, 15442, 15678, 15677, 15728, 13096, 4698, 4132, 4046, 4044, 4051 Signed-off-by: Haojian Zhuang <[email protected]>

If an assembler source contains no GNU-stack note, the system by default assumes that an executable stack may be required. GCC generates code to be executed on the stack when it implements a trampoline for nested functions. The default behavior brings out security issue. Signed-off-by: Haojian Zhuang <[email protected]>

hzhuang1 · 2023-01-28T03:28:02Z

Rebase the patch set since CI issue has been fixed.

Cyan4973 · 2023-02-02T22:06:06Z

This is a more complex PR, it will take some time to get through it.

To begin with, we may have another opportunity to divide and conquer here, and that would make each part easier to review. It seems there are 2 combined efforts that could be isolated :

SVE assembly : if I'm not mistaken, we already have an SVE code path in C. I presume the assembly version is probably more efficient. I also presume it neatly substitutes itself to the pure C code path, without any other change of logic.
This would make it possible to analyse this part on its own, separately.
The main issues will be around building and testing.
It's not yet clear if it can be merged into mainline, or is the territory of specialized forks.
arm dispatch : my understanding is that it extends the current x86dispatch.
It's unclear to me if this is a stand-alone capability that can be selected manually,
or if it's tied to some kind of automatic target determination system.
Here also, if it's possible to split that scope into smaller steps, this will increase the chances to pass review and get merged.
The main issues will be around complexity, both code and build complexities.
It has a decent chance to make it into mainline, but this outcome directly depends on the complexity introduced.

Cyan4973 · 2023-02-03T22:31:48Z

Speaking of Assembly version :

I was lucky enough to access a server with SVE support, a c7g unit from aws.
Now, this SoC supports SVE, but that's really all that can be said.
Performance wasn't much different from NEON, so I assume the vector unit is still limited to 128-bit.

Nonetheless, this was a good opportunity to compare the C SVE code path with the Assembly one.

And the difference was quite small, approximately +5% in favor of assembly.

Now, this could be because c7g is not a great SVE platform.
So a general question is : have you observed larger performance differences when benchmarking assembly SVE ? If yes, on which platforms and which scenarios ?

This matters, because assembly introduces a substantial build difficulty, so it should be matched by some corresponding benefit. If +5% is about the right expectation, then mainline is probably not the best target for it (though it is still a good reason to create a specialized SVE fork with this code).

easyaspi314 · 2023-02-04T02:30:26Z

From a small amount of digging. c7g (AWS Graviton 3) seems to be based on the Neoverse V1, which is SVE-256.

Looking at the optimization guide:

Interleaving the two vector operations is probably best going from just how ARMs work, as much as ACCRND helps readability
AND is better than UXTW as it has more throughput and can be executed on both pipelines (whyyyy it is literally the same thing 😕)
As a side note, XXH3_NEON_LANES=8 is probably optimal on this processor since it has 4 NEON decoders.

However, I question how much of the C performance is just compilers not being quite ready yet and in a few months the entire code will be obsolete (especially since now there are SVE compatible machines that are more readily available now).

Cyan4973 · 2023-02-04T03:19:49Z

So your guess is that SVE performance on Graviton3 could have been better,
but it's a matter of correctly optimizing for this architecture (manually or via compiler).

easyaspi314 · 2023-02-04T17:12:12Z

So your guess is that SVE performance on Graviton3 could have been better,
but it's a matter of correctly optimizing for this architecture (manually or via compiler).

Yes. If Graviton3 is an ARM design and not designed in-house, it is likely just as sensitive to instruction ordering and pipelining as the Cortexes I've been fiddling with.

easyaspi314 · 2023-02-04T18:03:33Z

As for the interleaved SVE, try replacing the sve256 loop block (L294 to L306) with this. Disclaimer, I haven't tested this and my ordering might not be ideal.

10:
    // since vector length is known, avoid predicates for more freedom
    ldr    z19, [x1]
    ldr    z20, [x1, #1, mul vl]
    ldr    z21, [x2]
    ldr    z22, [x2, #1, mul vl]
    prfd   pldl1strm, p7, [x1, #31, mul vl]
    eor    z21.d, z21.d, z19.d
    eor    z22.d, z22.d, z20.d
    tbl    z23.d, {z19.d}, z7.d
    tbl    z24.d, {z20.d}, z7.d
    lsr    z25.d, z21.d, #32
    and    z21.d, z21.d, #0xFFFFFFFF
    // Pretty sure this will forward to the add
    mad    z23.d, p7/m, z25.d, z21.d
      // Encourage V1 usage
      lsr    z26.d, z22.d, #32
    add    z17.d, z17.d, z23.d
      and    z22.d, z22.d, #0xFFFFFFFF
    mad    z24.d, p7/m, z26.d, z22.d
    add    z18.d, z18.d, z24.d
    add    x1, x1, #64
    add    x2, x2, #8
    add    x10, x10, #1
    cmp    x10, x3
    b.lt   10b

Cyan4973 · 2023-02-04T18:14:54Z

As for the interleaved SVE, try replacing the sve256 loop block (L294 to L306) with this. Disclaimer, I haven't tested this.

Unfortunately, my access to this platform was very temporary.
I don't have it anymore...

easyaspi314 · 2023-02-04T18:20:21Z

As a side note I wonder if it is beneficial to just use NEON on SVE-128.

easyaspi314 · 2023-02-04T18:21:39Z

As for the interleaved SVE, try replacing the sve256 loop block (L294 to L306) with this. Disclaimer, I haven't tested this.

Unfortunately, my access to this platform was very temporary.
I don't have it anymore...

No biggie.

hzhuang1 · 2023-02-05T14:37:58Z

Speaking of Assembly version :

I was lucky enough to access a server with SVE support, a c7g unit from aws. Now, this SoC supports SVE, but that's really all that can be said. Performance wasn't much different from NEON, so I assume the vector unit is still limited to 128-bit.

Let me check whether I can get aws c7g.

Nonetheless, this was a good opportunity to compare the C SVE code path with the Assembly one.

And the difference was quite small, approximately +5% in favor of assembly.

I mainly worked on Fujitsu A64FX platform. The performance is improved a lot with assembly code. As I remember, A64FX is based on ARMv8.2, and AWS c7g maybe based on ARMv8.4.

Now, this could be because c7g is not a great SVE platform. So a general question is : have you observed larger performance differences when benchmarking assembly SVE ? If yes, on which platforms and which scenarios ?

Yes, I mainly verified it on Fujitsu A64FX.

Implementation	Performance in Descending
SVE assembly	Best (reach for 15GB/s for the best case)
SVE C	Better (Reach for 10GB/s for the best case)
Scalar	Good (Less than 4GB/s for the best case)
NEON	Bad than Scalar

easyaspi314 · 2023-02-05T17:23:49Z

By the way I did some research on the A64FX, and from what it appears, NEON has been severely performance-deprecated (as in everything but the trivial instructions having 6-12 cycles latency) to put more power into SVE. This is an understandable design choice because it is designed for specialized uses and not general purpose.

So NEON being bad on this particular platform is unavoidable.

hzhuang1 · 2023-02-15T06:43:13Z

I tried it on c7g.

SVE doesn't work as expected. I'll keep investigating. As my understanding, SVE should work a bit better than NEON.

@easyaspi314 There's some issue in your code snippet. "LSR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>". If I switch to AND instruction from UXTW, I need one more instruction. So I could not gain benefit.

easyaspi314 · 2023-02-19T19:24:53Z

I think that for now we should only do SVE-512.

Looking at the optimization guide, c7g is a tradeoff because while SVE can process 2x the data, NEON always has at least 2x the IPC. (Also NEON already has 256-bit loads).

Also scalar can still be executed in the background, and we lose that benefit unless we mix in NEON which would just be messy and not worth the complexity.

However, with 100% optimal code, SVE will lose because the multiply is significantly slower and can't be parallelized:

Instruction	Latency	Throughput	Pipelines
NEON `umlal`	4 (1)	2	V0, V2
SVE `mad`	5 (2)	1/2 (!!!)	V0

(Parentheses are if the result can't be forwarded to another add instruction, which isn't the case in XXH3)

The ARM pipeline is so fun 💀

SVE2 may have better performance because it can use umlalb instead of and/uxtw + mad, but currently all SVE2-compatible processors are 128-bit, and the only reason SVE is viable is bandwidth.

hzhuang1 · 2023-02-20T00:18:34Z

I think that for now we should only do SVE-512.

Looking at the optimization guide, c7g is a tradeoff because while SVE can process 2x the data, NEON always has at least 2x the IPC. (Also NEON already has 256-bit loads).

Yes, I agree on it. SVE don't improve a lot performance on SVE-128 & SVE-256.

On SVE-256 (V1 core), I tried to tune assembly code. The latest performance is attached.

But it also drops while buffer size is increased. It's very similar as what happened on A64FX. I'll continue to check whether it could be improved.

I can simply list what I did.

Make similar SVE instruction running at the same time. It could fully make use of pipeline throughput.
Since scalar pipeline is different from vector pipeline. Make instructions interleaved.
Do not prefetch data in each loop. Prefetch once, and rest for other loops.

The debug code is in branch debug_sve_03 of my repo.

Also scalar can still be executed in the background, and we lose that benefit unless we mix in NEON which would just be messy and not worth the complexity.

However, with 100% optimal code, SVE will lose because the multiply is significantly slower and can't be parallelized:

Instruction Latency Throughput Pipelines

NEON umlal 4 (1) 2 V0, V2

SVE mad 5 (2) 1/2 (!!!) V0

(Parentheses are if the result can't be forwarded to another add instruction, which isn't the case in XXH3)

The ARM pipeline is so fun 💀

I found something interesting on different chips.

We know that NEON works well on Neoverse N1 & V1 core. But I also find that NEON works worse than scalar on Fujitsu A64FX and Hisilicon Kirin920. These two chips are older than N1 & V1.

My point is that vector instruction may not bring great performance in initial. But the cores are always evolving. They may become better in the future.

SVE2 may have better performance because it can use umlalb instead of and/uxtw + mad, but currently all SVE2-compatible processors are 128-bit, and the only reason SVE is viable is bandwidth.

Good point. Let me check whether it could bing benefits.

easyaspi314 · 2023-02-21T01:08:02Z

Yes, I agree on it. SVE don't improve a lot performance on SVE-128 & SVE-256.

On SVE-256 (V1 core), I tried to tune assembly code. The latest performance is attached.
(Snip)
But it also drops while buffer size is increased. It's very similar as what happened on A64FX. I'll continue to check whether it could be improved.

I assume that SVE is more sensitive to cache.

I can simply list what I did.

Make similar SVE instruction running at the same time. It could fully make use of pipeline throughput.

Since scalar pipeline is different from vector pipeline. Make instructions interleaved.

Do not prefetch data in each loop. Prefetch once, and rest for other loops.

The debug code is in branch debug_sve_03 of my repo.

Will take a look, thanks.

I found something interesting on different chips.

We know that NEON works well on Neoverse N1 & V1 core. But I also find that NEON works worse than scalar on Fujitsu A64FX and Hisilicon Kirin920. These two chips are older than N1 & V1.

There are some issues with 32-bit ARM that I want to work out which would cause issues on the latter. A lot of it seems to be issues with load-store.

If I write XXH3_accumulate_512_neon in 32-bit ARM inline asm, I can get a significant performance boost of up to 12 GB/s vs 7 GB/s. However, this is at a tradeoff of practicality — Google and Apple require all apps to have 64-bit support, so this will only really benefit Android devices that are at least 6 years old and armv7l Linux. I want the main target to be AArch64 and only go for low-footprint optimizations for 32-bit ARM.

My point is that vector instruction may not bring great performance in initial. But the cores are always evolving. They may become better in the future.

Hopefully. It unfortunately seems like ARM has gone back to 128-bit SVE for its newest designs, including the V2. This might be because, as we see on c7g, the throughput penalty makes it not that much better than 2 NEON instructions.

Processor	SVE version	No. Units	Width
A64FX	1	2?	512
Neo-V1 C7G	1	2	256
Neo-N2	2	2	128
Ctx-X2	2	4?	128
Neo-V2	2	4	128

SVE2 may have better performance because it can use umlalb instead of and/uxtw + mad, but currently all SVE2-compatible processors are 128-bit, and the only reason SVE is viable is bandwidth.

Good point. Let me check whether it could bing benefits.

c7g does not have SVE2.

However, if you have access to an SVE2 machine with a known 128-bit vector size, try replacing XXH3_accumulate_512_neon with this. This is just NEON but uses rev64+umlalb for a significantly faster shuffle+multiply. It requires manually setting the vector length though.

Also make sure to put #include <arm_neon.h> above the line that includes <arm_sve.h>.

typedef svuint64_t xxh_u64x2 __attribute__((arm_sve_vector_bits(128)));
typedef svuint32_t xxh_u32x4 __attribute__((arm_sve_vector_bits(128)));

XXH_FORCE_INLINE void XXH3_accumulate_512_neon(
    void *XXH_RESTRICT acc,
    const void *XXH_RESTRICT input,
    const void *XXH_RESTRICT secret
)
{
    size_t i;
    uint64x2_t *xacc = (uint64x2_t *) acc;
    const uint8_t *xinput = (const uint8_t *)input;
    const uint8_t *xsecret = (const uint8_t *)secret;
    XXH_ASSERT(svcntd() == 2);

    for (i = 6; i < 8; i++) {
        XXH3_scalarRound(acc, input, secret, i);
    }
    for (i = 0; i < 3; i++) {
        uint64x2_t data_vec = XXH_vld1q_u64(xinput + 16 * i);
        uint64x2_t key_vec = XXH_vld1q_u64(xsecret + 16 * i);
        uint64x2_t swapped = vextq_u64(data_vec, data_vec, 1);
        uint64x2_t mixed_lo = veorq_u64(data_vec, key_vec);
        /* (x << 32) | (x >> 32) */
        uint32x4_t mixed_hi = vrev64q_u32(vreinterpretq_u32_u64(mixed_lo));
        uint64x2_t mul = (uint64x2_t)(xxh_u64x2)
        svmlalb_u64(
            (xxh_u64x2)swapped,
            (xxh_u32x4)mixed_lo,
            (xxh_u32x4)mixed_hi
        );
        xacc[i] = vaddq_u64(mul, xacc[i]);
    }
}

When compiling, use these flags (also try both GCC and Clang)

make CFLAGS="-O3 -DXXH_VECTOR=XXH_NEON -march=armv8-a+sve2 -msve-vector-bits=128 -fno-tree-vectorize"

hzhuang1 · 2023-02-25T06:41:42Z

@easyaspi314 Although both NEON and SVE2 support the multiplication from two 32-bit inputs to 64-bit result. They're totally different.

     ---------------------
     D1L   D1H   D2L   D2H
     ---------------------
     D3L   D3H   D4L   D4H
     ---------------------
    
    SVE-128 (UMULLB):
     ---------------------
     D1L * D3L | D2L * D4L
     ---------------------
     D1H * D3H | D2H * D4H
     ---------------------
    
    NEON (UMULL):
     ---------------------
     D1L * D3L | D1H * D3H
     ---------------------
     D2L * D4L | D2H * D4H
     ---------------------

easyaspi314 · 2023-02-25T18:05:43Z

Yes, and the reason it is favorable is that instead of requiring the uzp1/uzp2 setup, it can be done with rev64. The complicated shuffle is what makes NEON less efficient than SSE2.

hzhuang1 · 2023-02-26T01:24:27Z

Yes, and the reason it is favorable is that instead of requiring the uzp1/uzp2 setup, it can be done with rev64. The complicated shuffle is what makes NEON less efficient than SSE2.

Sorry, I didn't get the point.

After uzp1/uzp2 instruction, the data layout is changed in below.

             -----------------------------
0x00   | D1L       D2L       D3L        D4L
             ------------------------------
0x10    | D1H      D2H      D3H        D4H
             ------------------------------

After umullb instruction, the data layout is changed in below. For convenience, I don't use umlalb at here.

  -------------------------------
  D1L  *   D1H    |  D3L  *   D3H
  -------------------------------
  D2L  *  D2L    |  D4L  * D4H
  -------------------------------

But we hope D3L*D3H should be stayed in 0x10, not 0x08. Then I need to more instructions to make them in order. rev64 can't help them in order at here.

easyaspi314 · 2023-02-26T03:15:17Z

Ah, you are confused because the uzp trick is for two vectors at once. This is for only one.

Come to think of it this would actually have literally zero benefit over the two vector approach aside from a minor data dependency for things that are going to be executed 4 at a time with identical timings (on the performance cores)

Specifically, instead of uzp1+uzp2 for 2 vectors, it is rev64 for one, and those both are 2 cycles, 4 throughput.

I take back what I said, SVE-128 is garbage, just use NEON 🤪

hzhuang1 · 2023-02-27T08:34:29Z

I tried to mix NEON and SVE2. At first, compiler reported error while I convert NEON data type to SVE type. Then, I switched to assembly. I met some strange error. While declaring .arch armv8-a+sve2 in assembly, compiler reported sve architecture is required. I don't know what's wrong with it. I failed it both in QEMU and SVE2 hardware that is provided by Ali cloud.

hzhuang1 · 2023-02-27T08:39:25Z

I plan to simplify the patch set. In dispatch, it checks the cpu. If it's SVE512, turn to assembly routine. If it's not, turn to NEON routine. Could it be acceptable?

hzhuang1 · 2023-02-28T02:45:30Z

I tried to mix NEON and SVE2. At first, compiler reported error while I convert NEON data type to SVE type. Then, I switched to assembly. I met some strange error. While declaring .arch armv8-a+sve2 in assembly, compiler reported sve architecture is required. I don't know what's wrong with it. I failed it both in QEMU and SVE2 hardware that is provided by Ali cloud.

It's fixed.

Since it's assembly, I only mix NEON and SVE2. I'll mix Scalar, NEON & SVE2 later.

With the help of SVE2, it could save one instruction and gain the performance. (RED vs GREEN)

easyaspi314 · 2023-02-28T04:12:54Z

That difference might solely be from it being handwritten assembly. However, even if it wasn't, I'd say that even if it is interleaved with scalar it clearly isn't going to do much to be worth an entirely different target from NEON.

I would say that the minimum to warrant something like that would be 20%, and it should be on a newer target that will be worth even more in the future and not an older one (hence why I wouldn't want an entire ARMv7-A NEON inline assembly implementation, even though it benefits performance)

hzhuang1 · 2023-02-28T05:48:35Z

OK. How about SVE512 assembly code? Could it be accepted?

easyaspi314 · 2023-02-28T17:40:44Z

I'd say yes, although I would recommend the following priority:

C intrinsics if possible — The limitation to SVE512 or larger can probably improve performance due to fewer checks and loads
Inline assembly hints/hacks (e.g. compiler guards, single instructions)
Inline assembly blocks
External assembly file

Also, dispatching on POSIX ELF would be trivial:

#include <sys/auxv.h>

__attribute__((target("+sve")))
static int XXH_isSVE512(void)
{
    return svcntd() >= 8;
}

__attribute__((constructor))
static void XXH_featureTest(void)
{
    if ((getauxval(AT_HWCAP) & HWCAP_SVE) != 0 && XXH_isSVE512()) {
        // SVE 512
    } else {
        // neon
    }
}

Cyan4973 · 2023-02-28T21:25:50Z

I would recommend the following priority:

C intrinsics if possible — The limitation to SVE512 or larger can probably improve performance due to fewer checks and loads

Agreed,
this first stage looks likely uncontroversial.

Cyan4973 · 2023-06-30T01:12:23Z

We haven't looked at this PR in a while,
I'm trying to get back to it during this Summer
in an attempt to determine which parts can be merged for next release.

I believe it's fair to say that our current code base is not ready to add an assembly file at this point. Merging xxh_arm64sve.S is probably not going to happen this cycle.

Note that, In the future, we'll try to change the code base to support multiple files, which might open a door for assembly source code again, but that's for later.

As for now, I note that this PR was also introducing other interesting tools and capability.
xxh_arm64dispatch.c for example, which is an equivalent of xxh_x86dispatch.c but for aarch64 targets, looks interesting.
More refined arch detection in xsum_arch.h also looks interesting.
There might be more.

@hzhuang1

display was mistakenly mentioning NEON instead. Codemod inspired by @hzhuang1 's #762.

Cyan4973 · 2023-07-12T23:17:47Z

I was wondering which parts of this PR could be salvaged for v0.8.2 release, and went for the simpler part, which is a SVE detection prompt for the CLI.

Another interesting technology presented in this PR is the arm64 automatic vector dispatch, which is a nice counterpoint to the x86/x64 one.

However, there is a non-negligible difference :
on x64 cpus, there is an obvious improvement path, from SSE2 to AVX2 and then AVX512. Each iteration is strictly better than the previous one.

For aarch64, it's not so clear . SVE might be better than NEON, but it may also not be.
If my understanding of previous discussions is correct, SVE is expected to beat NEON when it can operate, in hardware, with 512-bit vector lengths (or better). But at 128-bit, it's most likely less efficient than NEON. And 256-bit is not clear-cut either.

This means, the hardware detection is not limited to "just" detecting SVE support or not, but also what's the hardware width of vector units. My understanding is that the currently proposed dispatcher is not able to do that, and it's unclear if detecting this width at runtime is doable.

Anyway, this will require a bit more work in order to be merged.

At this point, I mostly wonder if this PR should remain opened, for reference, with the idea that a future aarch64 dispatcher could be created from it, or if it should be closed.

Cyan4973 · 2024-12-08T01:21:12Z

Closing, due to lack of activity.

While the topic of providing XXH3/XXH128 using SVE is interesting,
it still has to overcome a few obstacles,
the most serious of which are:

SVE is not an automatic win over NEON, and is more likely to be a regression on most arm chipsets, save a few specific ones with very wide vector registers. That makes it a very specific option, that should only be enabled on explicit request.
Managing an assembly file as part of the build process is complex and introduces major difficulties, that are then paid at maintenance time, weighting on portability and evolutivity. I strongly prefer not to introduce such difficulties, which would favor an implementation based on intrinsics when applicable, or inline assembly when absolutely necessary (as opposed to an external assembly file here).

hzhuang1 mentioned this pull request Nov 24, 2022

ARM SVE optimization #748

Closed

hzhuang1 force-pushed the sve_03 branch from e8315cd to 3fb458f Compare December 7, 2022 02:59

hzhuang1 added 4 commits January 28, 2023 09:26

dispatch: rename dispatch head file

b59878b

Rename xxh_x86dispatch.h to xxh_dispatch.h since it will be shared to other architectures. Signed-off-by: Haojian Zhuang <[email protected]>

hzhuang1 force-pushed the sve_03 branch from 3fb458f to f676a4c Compare January 28, 2023 01:55

Cyan4973 mentioned this pull request May 6, 2022

Suggestions list for future evolutions #458

Open

Cyan4973 added a commit that referenced this pull request Jul 2, 2023

Detect and display SVE when active

0da2381

display was mistakenly mentioning NEON instead. Codemod inspired by @hzhuang1 's #762.

Cyan4973 mentioned this pull request Jul 2, 2023

Detect and display SVE when active #849

Merged

Cyan4973 closed this Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SVE with assembly implementation #762

Support SVE with assembly implementation #762

hzhuang1 commented Nov 24, 2022

Cyan4973 commented Dec 1, 2022

hzhuang1 commented Jan 28, 2023

Cyan4973 commented Feb 2, 2023

Cyan4973 commented Feb 3, 2023

easyaspi314 commented Feb 4, 2023

Cyan4973 commented Feb 4, 2023

easyaspi314 commented Feb 4, 2023 •

edited

Loading

easyaspi314 commented Feb 4, 2023 •

edited

Loading

Cyan4973 commented Feb 4, 2023

easyaspi314 commented Feb 4, 2023

easyaspi314 commented Feb 4, 2023

hzhuang1 commented Feb 5, 2023

easyaspi314 commented Feb 5, 2023 •

edited

Loading

hzhuang1 commented Feb 15, 2023 •

edited

Loading

easyaspi314 commented Feb 19, 2023 •

edited

Loading

hzhuang1 commented Feb 20, 2023 •

edited

Loading

easyaspi314 commented Feb 21, 2023 •

edited

Loading

hzhuang1 commented Feb 25, 2023 •

edited

Loading

easyaspi314 commented Feb 25, 2023 •

edited

Loading

hzhuang1 commented Feb 26, 2023

easyaspi314 commented Feb 26, 2023 •

edited

Loading

hzhuang1 commented Feb 27, 2023

hzhuang1 commented Feb 27, 2023

hzhuang1 commented Feb 28, 2023

easyaspi314 commented Feb 28, 2023 •

edited

Loading

hzhuang1 commented Feb 28, 2023

easyaspi314 commented Feb 28, 2023 •

edited

Loading

Cyan4973 commented Feb 28, 2023

Cyan4973 commented Jun 30, 2023

Cyan4973 commented Jul 12, 2023

Cyan4973 commented Dec 8, 2024

Support SVE with assembly implementation #762

Support SVE with assembly implementation #762

Conversation

hzhuang1 commented Nov 24, 2022

Cyan4973 commented Dec 1, 2022

hzhuang1 commented Jan 28, 2023

Cyan4973 commented Feb 2, 2023

Cyan4973 commented Feb 3, 2023

easyaspi314 commented Feb 4, 2023

Cyan4973 commented Feb 4, 2023

easyaspi314 commented Feb 4, 2023 • edited Loading

easyaspi314 commented Feb 4, 2023 • edited Loading

Cyan4973 commented Feb 4, 2023

easyaspi314 commented Feb 4, 2023

easyaspi314 commented Feb 4, 2023

hzhuang1 commented Feb 5, 2023

easyaspi314 commented Feb 5, 2023 • edited Loading

hzhuang1 commented Feb 15, 2023 • edited Loading

easyaspi314 commented Feb 19, 2023 • edited Loading

hzhuang1 commented Feb 20, 2023 • edited Loading

easyaspi314 commented Feb 21, 2023 • edited Loading

hzhuang1 commented Feb 25, 2023 • edited Loading

easyaspi314 commented Feb 25, 2023 • edited Loading

hzhuang1 commented Feb 26, 2023

easyaspi314 commented Feb 26, 2023 • edited Loading

hzhuang1 commented Feb 27, 2023

hzhuang1 commented Feb 27, 2023

hzhuang1 commented Feb 28, 2023

easyaspi314 commented Feb 28, 2023 • edited Loading

hzhuang1 commented Feb 28, 2023

easyaspi314 commented Feb 28, 2023 • edited Loading

Cyan4973 commented Feb 28, 2023

Cyan4973 commented Jun 30, 2023

Cyan4973 commented Jul 12, 2023

Cyan4973 commented Dec 8, 2024

easyaspi314 commented Feb 4, 2023 •

edited

Loading

easyaspi314 commented Feb 4, 2023 •

edited

Loading

easyaspi314 commented Feb 5, 2023 •

edited

Loading

hzhuang1 commented Feb 15, 2023 •

edited

Loading

easyaspi314 commented Feb 19, 2023 •

edited

Loading

hzhuang1 commented Feb 20, 2023 •

edited

Loading

easyaspi314 commented Feb 21, 2023 •

edited

Loading

hzhuang1 commented Feb 25, 2023 •

edited

Loading

easyaspi314 commented Feb 25, 2023 •

edited

Loading

easyaspi314 commented Feb 26, 2023 •

edited

Loading

easyaspi314 commented Feb 28, 2023 •

edited

Loading

easyaspi314 commented Feb 28, 2023 •

edited

Loading