Skip to content

InstLatx64/InstLatX64_Demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InstLatX64_Demo

Collected source form of some ideas

GFNI_Demo.h -

wrapper header for non-cryptographical use of (V)GF2P8AFFINEQB instruction in style of Intel intrinsics:

  • emulating the missing byte-garnularity shift and rotate instructions;
    _(mm|mm256|mm512)(|_mask|_maskz)_(srli|srl|srai|sra|slli|sll|ror|rol)_gfni_epi8
  • variable versions also supported with GF2P8MULB instruction:
    _(mm|mm256|mm512)(|_mask|_maskz)_(srlv|sllv|rorv|rolv)_gfni_epi8
  • revbit, bit-broadcast, prefix-xor operations for bytes
    _(mm|mm256|mm512)(|_mask|_maskz)_(revbit|bcstbit|prefix_xor)_epi8
  • rotate, mirror, multiplication operations for 8x8 bit matrices
    _(mm|mm256|mm512)(|_mask|_maskz)_(mirror|rotate|multiplication)_8x8
  • auxiliary: imm8 operand of (V)GF2P8AFFINEQB xors the result bytes, so it useful e.g. for inverting the all above functions or using for compile time known byte broadcast without using GPRs, Port5 or memory
    _(mm|mm256|mm512)(|_mask|_maskz)_(inverse|set1_gfni)_epi8
  • entire register pospopcount (if AVX512_BITALG & AVX512_VPOPCNTDQ also supported):
    _(mm|mm256|mm512)_pospopcount_(u8|u16)_(si128|si256|si512)_epi8
    _(mm|mm256|mm512)(|_mask|_maskz)_(tzcnt|lzcnt)_gfni_epi8

VBMI2_Demo.h

wrapper header for VPSHLDW/VPSHRDW/VPSHLDVW/VPSHRDVW instructions for substituting the missing VPROLW/VPRORW/VPRORVW/VPRORVW instructions with the good old shld r1, r1 = rol r1 trick

    _(mm|mm256|mm512)(|_mask|_maskz)_(ror|rol)_vbmi2_epi16

wrapper header for emulating the missing byte-garnularity shift and rotate instructions in variable versions too

    _(mm|mm256|mm512)(|_mask|_maskz)_(slli|srli|srai|ror|rol)_vbmi2_epi8
    _(mm|mm256|mm512)(|_mask|_maskz)_(sllv|srlv|srav|rorv|rolv)_vbmi2_epi8

VPCLMULQDQ_Demo.h

experimental implementation of entire register (128/256/512b, xmm/ymm/zmm) prefix-xor operation with the VPCLMULQDQ extension

    _mm_prefix_xor_clmul_si128(__m128i a);
    _mm256_prefix_xor_clmul_si256(__m256i a);
    _mm512_prefix_xor_clmul_si512(__m512i a);

Compiler_Intrinsic_Test.cpp

for testing Visual Studio AVX512 capabilities

TZCNT_Demo.cpp

Emulating the missing SIMD VPTZCNTB / VPTZCNTW / VPTZCNTD / VPTZCNTQ instructions

LZCNT_Demo.cpp

Emulating the missing SIMD VPLZCNTB / VPLZCNTW instructions

PEXT_PDEP_Emu.cpp

Faster PEXT and PDEP emulation for AMD Excavator/Zen/Zen+/Zen2 based on Zach Wegner's ZP7 (Zach's Peppy Parallel-Prefix-Popcountin' PEXT/PDEP Polyfill)

CPU_Props.*

detection of CPU properties for dispatching code paths

AVX512_DecimalPrint.*

AVX512F, AVX512IFMA based implementation of _ultoa, _ltoa, _ui64toa, _i64toa functions.

AVX512_KMemDst.*

code for examining the effect of the k mask register value on the EVEX-decoded instructions with memory destination

Zen4_Demo.*

code for examining of instructions in AMD Zen4/Raphael CPU (CPUID A60F12). It is based on ideas from uops.info. Output example: \Results\Zen4_Demo_Imm8.txt

B2B_Demo.*

VPERMI2B based code for fast any-to-any byte replacement. It can be useful e.g. for tolower/toupper type conversions or isxdigit/isalnum type classifications. Performance results:

AVX512_Reduce_Add.*

(DB)SAD based _mm512_reduce_add_epu8/16/32/64 implementation

AVX512_Saturated_AddSub.*

_mm512_adds/subs_epi/epu/32/64 implementation

FirstByte.*

Finding first byte in lanes _mm256|512_firstbyte_epu32/64 implementation

HWBITPERM.*

SVE2 vector BITPERM (BEXT/BDEP/BGRP) emulation with HW scalar BMI2 PEXT/PDEP instructions

AVX512_BGVSER.*

Byte-Granularity Variable Shift on Entire Register

    _(mm256|mm512))_(bsll|bsrl)_epi(256|512) [placeholder]
    _(mm256|mm512))_palign(l|r)_epi(256|512)
    _(mm256|mm512))_rotate(l|r)_epi(256|512)

AVX_VNNI_INT16_Saturated_AddSub.*

AVX_VNNI_INT16 based (mm|mm256)(adds|subs)_epi32 emulation proposal

References

About

InstLatX64_Demo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published