Conversation
|
Could you add some clarity? This PR : The bottleneck is probably: latency from the random lookups and the thruput of the shuffle port. The CPU can have 20+ loads "in-flight" at once. The total loads issued per cycle is not that interesting for unrolling (we always have the same number of loads). It is mostly for scheduling instructions at the assembly level, and to ballpark whether we're IO bound (e.g. do we want to replace a lookup with a calculation, are we spilling and reloading register, etc.). |
|
Tables contain results for |
according to uops from But
but PS: I'm actually new to all these simd stuff so excuse me if I say something stupid) |
These are named registers, but we have many more registers. You can examine the issue experimentally... |
|
A The terms would be register renaming and out-of-order execution, I guess. uops.info Code Analyzer is probably a good place to start, if we want to micro-optimize this. |
|
I think I get it roughly. But I struggle to draw parallels with actual code in general. But fortunately, for this PR I'm not into tuning this exact implementation but rather introducing the I additionally checked Tomorrow I'll try to found out whether building separately helps with |
|
It doesn't. But I found that UPD: I replaced inlined functions with macros. Now everything looks much better. Unroll factors are likely not the most optimal, there were some better ones during development. I'm gonna write a script to brute force them. |
replace inline functions in loop body to macros might also help with msvc
If we're willing to do 4 lookups per 16 bytes of input, then we'd only use 2 cache lines for tables. Note: I haven't actually studied the utf16->utf8 function so I don't what that is doing... |
Recently I've learned that all Intel processors since
Sandy Bridgecan do two_mm_loadu_si128at the same time with port 2 and 3. So I tried 2 sequential_mm_loadu_si128and it was a success. Then I also tried 4 and 8. 4 gave me an additional boost, but not 8.Alas, when I pulled upstream commits, I got a significant performance penalty for the
esperantofile withmsvc. So I dropped them and started adding one by one. And I found which one causes it.7761599 SSE UTF16 => latin1 (#311)
It seems there's nothing special here. It just added a new dependency with 2 other
sseimplementations.So I also checked with
gccand there was no penalty.Could it be a
msvcbug?inlined version
======================================================================"the commit" is 7761599,
current branch is sse_convert_latin1_to_utf8_perf
command
benchmark -P convert_latin1_to_utf8+westmere -F *.latin1.txtarch:
Sandy Bridge======================================================================
windows 10
msvc VS 17.5.5
msvc VS 17.7.4
LLVM(clang-cl) 16.0.5
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)
mingw-w64-ucrt-x86_64-gcc 13.1.0-7
build error.
======================================================================
wsl2 ubuntu 22.04
gcc 11.4.0
clang 14.0.0-1ubuntu1.1
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)
The situation got even funnier when I removed all the loops except this one, and got the opposite result. And it's quite consistent between benchmarks.
msvc VS 17.5.5
macros version
======================================================================"the commit" is 7761599,
current branch is sse_convert_latin1_to_utf8_perf
command
benchmark -P convert_latin1_to_utf8+westmere -F *.latin1.txtarch:
Sandy Bridge======================================================================
windows 10
msvc VS 17.7.4
LLVM(clang-cl) 16.0.5
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)
mingw-w64-ucrt-x86_64-gcc 13.1.0-7
build error.
======================================================================
wsl2 ubuntu 22.04
gcc 11.4.0
clang 14.0.0-1ubuntu1.1
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)
I'm going to continue the investigation in a couple of days.
plan:
*
I suspect that building it as a shared lib might help as it would prevent access ofmsvcto the rest of the code.Supposedly, that wouldn't allow it to perform some smart optimisations and thus results should be more stable.
sseimplementations affects performanceFor now, I suggest considering unrolling as unstable.