Collected source form of some ideas
wrapper header for non-cryptographical use of (V)GF2P8AFFINEQB instruction in style of Intel intrinsics:
- emulating the missing byte-garnularity shift and rotate instructions;
_(mm|mm256|mm512)(|_mask|_maskz)_(srli|srl|srai|sra|slli|sll|ror|rol)_gfni_epi8
- variable versions also supported with GF2P8MULB instruction:
_(mm|mm256|mm512)(|_mask|_maskz)_(srlv|sllv|rorv|rolv)_gfni_epi8
- revbit, bit-broadcast, prefix-xor operations for bytes
_(mm|mm256|mm512)(|_mask|_maskz)_(revbit|bcstbit|prefix_xor)_epi8
- rotate, mirror, multiplication operations for 8x8 bit matrices
_(mm|mm256|mm512)(|_mask|_maskz)_(mirror|rotate|multiplication)_8x8
- auxiliary: imm8 operand of (V)GF2P8AFFINEQB xors the result bytes, so it useful e.g. for inverting the all above functions or using for compile time known byte broadcast without using GPRs, Port5 or memory
_(mm|mm256|mm512)(|_mask|_maskz)_(inverse|set1_gfni)_epi8
- entire register pospopcount (if AVX512_BITALG & AVX512_VPOPCNTDQ also supported):
_(mm|mm256|mm512)_pospopcount_(u8|u16)_(si128|si256|si512)_epi8
- tzcnt, lzcnt for bytes (idea of https://gist.github.com/animetosho/6cb732ccb5ecd86675ca0a442b3c0622)
_(mm|mm256|mm512)(|_mask|_maskz)_(tzcnt|lzcnt)_gfni_epi8
wrapper header for VPSHLDW/VPSHRDW/VPSHLDVW/VPSHRDVW instructions for substituting the missing VPROLW/VPRORW/VPRORVW/VPRORVW instructions with the good old shld r1, r1 = rol r1 trick
_(mm|mm256|mm512)(|_mask|_maskz)_(ror|rol)_vbmi2_epi16
wrapper header for emulating the missing byte-garnularity shift and rotate instructions in variable versions too
_(mm|mm256|mm512)(|_mask|_maskz)_(slli|srli|srai|ror|rol)_vbmi2_epi8
_(mm|mm256|mm512)(|_mask|_maskz)_(sllv|srlv|srav|rorv|rolv)_vbmi2_epi8
experimental implementation of entire register (128/256/512b, xmm/ymm/zmm) prefix-xor operation with the VPCLMULQDQ extension
_mm_prefix_xor_clmul_si128(__m128i a);
_mm256_prefix_xor_clmul_si256(__m256i a);
_mm512_prefix_xor_clmul_si512(__m512i a);
for testing Visual Studio AVX512 capabilities
Emulating the missing SIMD VPTZCNTB / VPTZCNTW / VPTZCNTD / VPTZCNTQ instructions
Emulating the missing SIMD VPLZCNTB / VPLZCNTW instructions
Faster PEXT and PDEP emulation for AMD Excavator/Zen/Zen+/Zen2 based on Zach Wegner's ZP7 (Zach's Peppy Parallel-Prefix-Popcountin' PEXT/PDEP Polyfill)
detection of CPU properties for dispatching code paths
AVX512F, AVX512IFMA based implementation of _ultoa, _ltoa, _ui64toa, _i64toa functions.
code for examining the effect of the k mask register value on the EVEX-decoded instructions with memory destination
code for examining of instructions in AMD Zen4/Raphael CPU (CPUID A60F12). It is based on ideas from uops.info. Output example: \Results\Zen4_Demo_Imm8.txt
VPERMI2B based code for fast any-to-any byte replacement. It can be useful e.g. for tolower/toupper type conversions or isxdigit/isalnum type classifications. Performance results:
(DB)SAD based _mm512_reduce_add_epu8/16/32/64 implementation
_mm512_adds/subs_epi/epu/32/64 implementation
Finding first byte in lanes _mm256|512_firstbyte_epu32/64 implementation
SVE2 vector BITPERM (BEXT/BDEP/BGRP) emulation with HW scalar BMI2 PEXT/PDEP instructions
Byte-Granularity Variable Shift on Entire Register
_(mm256|mm512))_(bsll|bsrl)_epi(256|512) [placeholder]
_(mm256|mm512))_palign(l|r)_epi(256|512)
_(mm256|mm512))_rotate(l|r)_epi(256|512)
AVX_VNNI_INT16 based (mm|mm256)(adds|subs)_epi32 emulation proposal
- Geoff Langdale Why Ice Lake is Important (a bit-basher’s perspective)
- Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire Efficient Computation of Positional Population Counts Using SIMD Instructions
- Wojciech Muła AVX512VBMI — remove spaces from text
- Zach Wegner ZP7 (Zach's Peppy Parallel-Prefix-Popcountin' PEXT/PDEP Polyfill)
- Abel, Andreas and Reineke, Jan uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures
- PerforatedBlob TZCNT - TERNLOG->ANDN
- TravisDowns Scalar/HW GPR PDEP/PEXT reference code
- Daniel Lemire Converting integers to decimal strings faster with AVX-512
- KMemDst results: Intel SKX/CNL/TGL/RKL/ADL, AMD RPH
- Geoff Langdale's Byte2Byte question
- Geoff Langdale's reduce_add inspiration
- A list of “out-of-band” uses for the GF2P8AFFINEQB instruction I haven't seen documented elsewhere: idea of tzcnt/lzcnt_gfni_epi8, sllv/srlv_gfni_epi8
- FirstByte inspiration
- Robert Clausecker BGVSER inspiration