4 releases

Uses new Rust 2024

0.6.0	Dec 7, 2025
0.5.2	Dec 6, 2025
0.5.1	Dec 6, 2025
0.5.0	Dec 6, 2025

#459 in Math

MIT/Apache

29KB
317 lines

Crate to call the clmul instruction

Otherwise known as:

Carry-Less Multiplication
Carry-Less Multiply
Carry-Less Product

WARNING!

This crate has only been tested on Apple Silicon, please file a bug report if you encounter any problems.

TL;DR;

Its an operation that does a multiply of 2 integers using XOR to "add" the partial results.

Current version: 0.6.0

License: MIT OR Apache-2.0

Use

use clmul::clmul;

//clmul(a:u64,b:u64);
let res=clmul(4,u64:MAX);

To quote RISC-V Bitmanip Extension V0.90 23:

The classic applications for clmul are CRC [11, 22] and GCM, but more applications exist, including the following examples. There are obvious applications in hashing and pseudo random number generations. For exam- ple, it has been reported that hashes based on carry-less multiplications can outperform Google’s CityHash [15]. clmul of a number with itself inserts zeroes between each input bit. This can be useful for generating Morton code [21]. clmul of a number with -1 calculates the prefix XOR operation. This can be useful for decoding gray codes. Another application of XOR prefix sums calculated with clmul is branchless tracking of quoted strings in high-performance parsers. [14] Carry-less multiply can also be used to implement Erasure code efficiently. [12]

Wikipedia

https://en.wikipedia.org/wiki/CLMUL_instruction_set

Carry-less multiplication is polynomial multiplication, where each bit in a register is the coefficient of a polynomial over GF(2).

RiscV

RISC-V Bitmanip Extension V0.90 23

2.4 Carry-less multiply (clmul, clmulh, clmulr)

"That means clmulh is equivalent to clmulr followed by a 1-bit right shift. (The MSB of a clmulh result is always zero.) Another equivalent definition of clmulr is clmulr(a,b) := rev(clmul(rev(a), rev(b))). (The “r” in clmulr means reversed.)"

https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-draft.pdf

LLVM

https://github.com/llvm-mirror/clang/blob/master/lib/Headers/__wmmintrin_pclmul.h#L26

RFC to include clmul in LLVM intrinsics

https://discourse.llvm.org/t/rfc-carry-less-multiplication-instruction/1347

CRC

http://www.ross.net/crc/download/crc_v3.txt

Barrett Reduction

https://www.corsix.org/content/barrett-reduction-polynomials

Intel

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=clmul

Rust

Issue Implement CLMUL instruction set

https://github.com/rust-lang/stdarch/issues/318

Implemented in

core::core_arch::x86::pclmulqdq::_mm_clmulepi64_si128

pub unsafe fn _mm_clmulepi64_si128( a: __m128i, b: __m128i, const IMM8: i32 ) -> __m128i

Compiler flags

//--cfg polyval_armv8 //intel CMLMUL // On aarch64 targets including aarch64-apple-darwin (Apple M1) and Linux targets such as aarch64-unknown-linux-gnu and aarch64-unknown-linux-musl, support for using the PMULL instructions in ARMv8’s Cryptography Extensions with the following RUSTFLAGS: // // --cfg polyval_armv8 // // On Linux and macOS when the ARMv8 features are enabled, support for PMULL // intrinsics is autodetected at runtime. On other platforms the crypto target feature must be enabled via RUSTFLAGS.

    // RUSTFLAGS="-Ctarget-cpu=sandybridge" cargo bench
    //core::arch::aarch64::poly8x8_t

Benchmarks

https://stackoverflow.com/questions/39490345/interleave-bits-efficiently

CLMUL is great on Broadwell and later (1 uop, 1c throughput), or 3 uops (2c throughput) for the YMM/ZMM version on Ice Lake. Then down to 1 uop again for the YMM/ZMM version on Alder Lake. On AMD, it's 4 uops (2c throughput not fully pipelined) for the XMM version on Zen1/2/3 (and the YMM version which is new in Zen3). uops.info. But pdep is disastrously slow on Zen 1/2, only having proper hw support in Zen3. –

https://encode.su/threads/3760-clmul-instruction-for-interleaving-bits-(For-improve-cache-speed)

CLMUL is 6-7clk, so I doubt that it could be useful as a hash function.

https://stackoverflow.com/questions/53401547/is-clmul-constant-time

Numbers from Agner Fog's tables

On Intel since Broadwell, pclmuludq is 1 uop. On Skylake, it's 7 cycle latency, 1 per clock throughput. (So you need to keep 7 independent PCLMUL operations in flight to saturate the execution unit on port 5). Broadwell has 5 cycle latency. With a memory source operand, it's 1 extra uop.

On Haswell, it's 3 uops (2p0 p5) with 7 cycle latency and one per 2 clock throughput.

On Sandybridge/IvyBridge it's 18 uops, 14c latency, one per 8 clock throughput.

On Westmere (2nd Gen Nehalem) it's 12c latency, one per 8c throughput. (Unknown number of uops, neither Agner Fog nor uops.info has it. But we can safely assume it's microcoded.) This was the first generation to support the instruction- one of the very few differences from Nehalem to Westmere.

On Ryzen it's 4 uops, 4c latency, one per 2 clock throughput. http://instlatx64.atw.hu/ shows it 4.5 cycle latency. I'm not sure what the difference is between their testing and Agner's.

On Piledriver it's 5 uops, 12c latency, one per 7 clock throughput.

On Jaguar it's 1 uop, 3c latency, one per 1 clock throughput!

On Silvermont it's 8 uops, 10c latency/throughput. Goldmont = 3 uops, 6c lat / 3c tput.

Dependencies

~37KB