v7: ⬆️ 500 Kernels for Mixed Precision Numerics on CPUs #220

ashvardanian · 2024-10-31T23:12:22Z

It started as a straightforward optimization request from the @albumentations-team: to improve the special case of the wsum (Weighted Sum) operation for the "non-weighted" scenario and to add APIs for scalar multiplication and addition. This update introduces new public APIs in both C and Python:

scale: Implements $\alpha * A_i + \beta$
sum: Computes $A_i + B_i$

Recognizing the value of consistency with widely-used libraries, we’ve also added "aliases" aligned with names familiar to developers using NumPy and OpenCV for element-wise addition and multiplication across vectors and scalars:

NumPy	OpenCV	SimSIMD
`np.add`	`cv.add`	`simd.add`
`np.multiply`	`cv.multiply`	`simd.multiply`

Note: SimSIMD and NumPy differ in handling certain corner cases. SimSIMD offers broader support, with up to 64 tensor dimensions (compared to NumPy’s 32), wider compatibility with Python versions, operating systems, hardware, and numeric types—and of course, greater speed! However, SimSIMD requires input vectors to be of identical types. For integers, it also supports saturation to prevent overflow/underflow, which can simplify debugging but may be unexpected for some developers.

The real excitement came when we realized that larger projects would take time to adopt emerging numeric types like bfloat16 and float8, which are well-known in AI circles. To bridge this gap, SimSIMD now introduces an AnyTensor type designed for maximum interoperability via CPython's Buffer Protocol and beyond, setting it apart from similar types in NumPy, PyTorch, TensorFlow, and JAX.

Tensor Class for C, Python, and Rust 🦀

Element-wise Operations 🧮

Reduced Range Trigonometry 📐

Geospatial Operations 🛰️

Breaking:

the project is renamed from "SimSIMD" to "MathKong", aligning with StringZilla.
"cosine" distance is now called "angular" to avoid confusion with trigonometric element-wise functions.
the flush_denormals functions become configure_thread and also enables AMX and SME.
DistancesTensor in Python is replaced with NDArray to match NumPy API.
kernel have different output types instead of applying simsimd_distance_t uniformly.
complex products output complex types.
the ABI of element-wise operation changed taking scaling factors by pointer.

If you have any feedback regarding the limitations of current array-processing software in a single- or multi-node AI training settings, I am all ears 👂

This entry is largely unnecessary, and its computation in linearization procedure depends on the value at the previous dim, making it hard to parallelize with SIMD.

Haswell was already pretty clean, but NEON and Skylake needed polishing

ashvardanian changed the title ~~Element-wise BLAS-like APIs~~ Element-wise BLASAPIs & new Tensor for Python Nov 1, 2024

ashvardanian changed the title ~~Element-wise BLASAPIs & new Tensor for Python~~ Element-wise BLAS APIs & new Tensor for Python Nov 1, 2024

ashvardanian force-pushed the main-elementwise branch 2 times, most recently from 14fd5d3 to 18c41fd Compare November 5, 2024 22:32

ashvardanian added 10 commits November 6, 2024 00:32

Improve: Drop global_offset

4bfe1d1

This entry is largely unnecessary, and its computation in linearization procedure depends on the value at the previous dim, making it hard to parallelize with SIMD.

Add: mdspan

ac5841f

Add: Type-casts to & from [iuf]64

4c69e7d

Break: Support mixed-type element-wise ops

4646d6b

Break: Shorter op-codes

383b799

Improve: Same type-casting as NumPy

08010ba

Add: i16 element-wise kernels for NEON

54bb07d

Add: i32 element-wise kernels for NEON

1f91b92

Add: i64 element-wise kernels for NEON

75993e7

Break: Shorter symbol names

38df49c

ashvardanian force-pushed the main-elementwise branch from 56f1a5d to 38df49c Compare November 8, 2024 15:33

ashvardanian added 8 commits November 8, 2024 15:38

Add: i16 element-wise kernels for Haswell

0e7c656

Add: i32 element-wise kernels for Haswell

e2698b0

Add: i8 element-wise kernels for Skylake

d10d27e

Add: i16 element-wise kernels for Skylake

8950a7e

Add: i32 element-wise kernels for Skylake

d1bb51c

Add: i64element-wise kernels for Skylake

463e8f3

Improve: Unsigned type literals for masks

e089626

Add: Element-wise saturated addition for Ice Lake

09735ea

ashvardanian changed the title ~~Element-wise BLAS APIs & new Tensor for Python~~ Element-wise BLAS APIs & new Tensor for Python: ⬆️ 450 kernels Nov 9, 2024

ashvardanian added 6 commits November 9, 2024 15:55

Add: Dynamic dispatch for element-wise ops

602f812

Add: Missing serial integer wsum-s

3aac9ad

Fix: Match type-casting rules of NumPy

02236d1

Add: simsimd.multiply

48bd712

Fix: _mm256_adds_epi32 emulation

7aa118b

Fix: Serial emulation of _mm256_adds_epu32

8295e11

ashvardanian added 30 commits December 29, 2025 17:16

Add: Serial fallbacks for elementwise & reduction ops

c68ec46

Improve: Consolidate partial loads

ddf0097

Haswell was already pretty clean, but NEON and Skylake needed polishing

Improve: Neumaier scheme instead of Kahan

523cf24

Add: F8 reductions for Haswell & NEON variants

c64b227

Add: F8 elementwise ops for Haswell & NEON variants

696c66b

Add: Missing F8/F16 dispatch tables

608a621

Add: Half-precision mesh for Haswell & NEON

ef4f0f8

Add: Extensive ULP testing for all APIs

55f6b34

Improve: Time-constrained testing

a305b72

Fix: Stale SVE kernel names

17ea3a7

Chore: Split & extend Python binding

3a7ce06

Chore: Split & extend Rust binding

1037864

Improve: Test Python against ml_dtypes

7238049

Improve: Test with Cauchy distribution and F128

b9e52c6

Fix: Missing constants

a1ec738

Make: Compile with BLAS, MKL, & Accelerate

c93386b

Fix: avxvnni flag

04f0eaf

Improve: Reduce port 5 pressure

de76ebd

Fix: x86 compilation issues

dcb8332

Improve: Neumaier for f64 → f64 reductions

2a9a320

Improve: Use fewer bits for f8/f16/bf16 sums

3cad2d8

Improve: Reduce port 5 pressure

c8302e4

Add: Reduce on Skylake & Ice Lake

e50c4ee

Add: F64 KLD/JSD in AVX-512

91a2dc7

Fix: Iterative widening to 64 bits on Ice Lake

ae40d81

Docs: @copydoc for reductions

cfa6528

Add: Relevant Sierra kernels

1f1e25b

Add: Missing Haswell F64 GEMM

364e29b

Add: Missing b8 Haswell strided blend

957545d

Add: Half-precision trigonometry for SPR

c49c337

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v7: ⬆️ 500 Kernels for Mixed Precision Numerics on CPUs #220

v7: ⬆️ 500 Kernels for Mixed Precision Numerics on CPUs #220

Uh oh!

ashvardanian commented Oct 31, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

v7: ⬆️ 500 Kernels for Mixed Precision Numerics on CPUs #220

Are you sure you want to change the base?

v7: ⬆️ 500 Kernels for Mixed Precision Numerics on CPUs #220

Uh oh!

Conversation

ashvardanian commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tensor Class for C, Python, and Rust 🦀

Element-wise Operations 🧮

Reduced Range Trigonometry 📐

Geospatial Operations 🛰️

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ashvardanian commented Oct 31, 2024 •

edited

Loading