Skip to content

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐

License

Notifications You must be signed in to change notification settings

ashvardanian/SimSIMD

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NumKong banner

Computing dot-products, similarity measures, and distances between low- and high-dimensional vectors is ubiquitous in Machine Learning, Scientific Computing, Geospatial Analysis, and Information Retrieval. These algorithms generally have linear complexity in time, constant or linear complexity in space, and are data-parallel. In other words, it is easily parallelizable and vectorizable and often available in packages like BLAS (level 1) and LAPACK, as well as higher-level numpy and scipy Python libraries. Ironically, even with decades of evolution in compilers and numerical computing, most libraries can be 3-200x slower than hardware potential even on the most popular hardware, like 64-bit x86 and Arm CPUs. Moreover, most lack mixed-precision support, which is crucial for modern AI! The rare few that support minimal mixed precision, run only on one platform, and are vendor-locked, by companies like Intel and Nvidia. NumKong provides an alternative. 1️⃣ NumKong functions are practically as fast as memcpy. 2️⃣ Unlike BLAS, most kernels are designed for mixed-precision and bit-level operations. 3️⃣ NumKong often ships more binaries than NumPy and has more backends than most BLAS implementations, and more high-level interfaces than most libraries.

Features

NumKong (Arabic: "سيمسيم دي") is a mixed-precision math library of over 350 SIMD-optimized kernels extensively used in AI, Search, and DBMS workloads. Named after the iconic "Open Sesame" command that opened doors to treasure in Ali Baba and the Forty Thieves, NumKong can help you 10x the cost-efficiency of your computational pipelines. Implemented distance functions include:

  • Euclidean (L2) and Cosine (Angular) spatial distances for Vector Search. docs
  • Dot-Products for real & complex vectors for DSP & Quantum computing. docs
  • Hamming (~ Manhattan) and Jaccard (~ Tanimoto) bit-level distances. docs
  • Set Intersections for Sparse Vectors and Text Analysis. docs
  • Mahalanobis distance and Quadratic forms for Scientific Computing. docs
  • Kullback-Leibler and Jensen–Shannon divergences for probability distributions. docs
  • Fused-Multiply-Add (FMA) and Weighted Sums to replace BLAS level 1 functions. docs
  • For Levenshtein, Needleman–Wunsch, and Smith-Waterman, check StringZilla.
  • 🔜 Haversine and Vincenty's formulae for Geospatial Analysis.

Moreover, NumKong...

  • handles float64, float32, float16, and bfloat16 real & complex vectors.
  • handles int8 integral, int4 sub-byte, and b8 binary vectors.
  • handles sparse uint32 and uint16 sets, and weighted sparse vectors.
  • is a zero-dependency header-only C 99 library.
  • has Python, Rust, JS, and Swift bindings.
  • has Arm backends for NEON, Scalable Vector Extensions (SVE), and SVE2.
  • has x86 backends for Haswell, Skylake, Ice Lake, Genoa, and Sapphire Rapids.
  • with both compile-time and runtime CPU feature detection easily integrates anywhere!

Due to the high-level of fragmentation of SIMD support in different x86 CPUs, NumKong generally uses the names of select Intel CPU generations for its backends. They, however, also work on AMD CPUs. Intel Haswell is compatible with AMD Zen 1/2/3, while AMD Genoa Zen 4 covers AVX-512 instructions added to Intel Skylake and Ice Lake. You can learn more about the technical implementation details in the following blog-posts:

Benchmarks

NumPy C 99 NumKong
cosine distances between 1536d vectors in int8
🚧 overflows
x86: 10,548,600 ops/s
arm: 11,379,300 ops/s
x86: 16,151,800 ops/s
arm: 13,524,000 ops/s
cosine distances between 1536d vectors in bfloat16
🚧 not supported
x86: 119,835 ops/s
arm: 403,909 ops/s
x86: 9,738,540 ops/s
arm: 4,881,900 ops/s
cosine distances between 1536d vectors in float16
x86: 40,481 ops/s
arm: 21,451 ops/s
x86: 501,310 ops/s
arm: 871,963 ops/s
x86: 7,627,600 ops/s
arm: 3,316,810 ops/s
cosine distances between 1536d vectors in float32
x86: 253,902 ops/s
arm: 46,394 ops/s
x86: 882,484 ops/s
arm: 399,661 ops/s
x86: 8,202,910 ops/s
arm: 3,400,620 ops/s
cosine distances between 1536d vectors in float64
x86: 212,421 ops/s
arm: 52,904 ops/s
x86: 839,301 ops/s
arm: 837,126 ops/s
x86: 1,538,530 ops/s
arm: 1,678,920 ops/s
euclidean distance between 1536d vectors in int8
x86: 252,113 ops/s
arm: 177,443 ops/s
x86: 6,690,110 ops/s
arm: 4,114,160 ops/s
x86: 18,989,000 ops/s
arm: 18,878,200 ops/s
euclidean distance between 1536d vectors in bfloat16
🚧 not supported
x86: 119,842 ops/s
arm: 1,049,230 ops/s
x86: 9,727,210 ops/s
arm: 4,233,420 ops/s
euclidean distance between 1536d vectors in float16
x86: 54,621 ops/s
arm: 71,793 ops/s
x86: 196,413 ops/s
arm: 911,370 ops/s
x86: 19,466,800 ops/s
arm: 3,522,760 ops/s
euclidean distance between 1536d vectors in float32
x86: 424,944 ops/s
arm: 292,629 ops/s
x86: 1,295,210 ops/s
arm: 1,055,940 ops/s
x86: 8,924,100 ops/s
arm: 3,602,650 ops/s
euclidean distance between 1536d vectors in float64
x86: 334,929 ops/s
arm: 237,505 ops/s
x86: 1,215,190 ops/s
arm: 905,782 ops/s
x86: 1,701,740 ops/s
arm: 1,735,840 ops/s

For benchmarks we mostly use 1536-dimensional vectors, like the embeddings produced by the OpenAI Ada API. The code was compiled with GCC 12, using glibc v2.35. The benchmarks performed on Arm-based Graviton3 AWS c7g instances and r7iz Intel Sapphire Rapids. Most modern Arm-based 64-bit CPUs will have similar relative speedups. Variance within x86 CPUs will be larger.

Similar speedups are often observed even when compared to BLAS and LAPACK libraries underlying most numerical computing libraries, including NumPy and SciPy in Python. Broader benchmarking results:

Algorithms & Design Decisions 📚

In general there are a few principles that NumKong follows:

  • Avoid loop unrolling.
  • Never allocate memory.
  • Never throw exceptions or set errno.
  • Keep all function arguments the size of the pointer.
  • Avoid returning from public interfaces, use out-arguments instead.
  • Don't over-optimize for old CPUs and single- and double-precision floating-point numbers.
  • Prioritize mixed-precision and integer operations, and new ISA extensions.
  • Prefer saturated arithmetic and avoid overflows.

Possibly, in the future:

  • Best effort computation silencing NaN components in low-precision inputs.
  • Detect overflows and report the distance with a "signaling" NaN.

Last, but not the least - don't build unless there is a demand for it. So if you have a specific use-case, please open an issue or a pull request, and ideally, bring in more users with similar needs.

Cosine Similarity, Reciprocal Square Root, and Newton-Raphson Iteration

The cosine similarity is the most common and straightforward metric used in machine learning and information retrieval. Interestingly, there are multiple ways to shoot yourself in the foot when computing it. The cosine similarity is the inverse of the cosine distance, which is the cosine of the angle between two vectors.

$$\text{CosineSimilarity}(a, b) = \frac{a \cdot b}{\|a\| \cdot \|b\|}$$ $$\text{CosineDistance}(a, b) = 1 - \frac{a \cdot b}{\|a\| \cdot \|b\|}$$

In NumPy terms, NumKong implementation is similar to:

import numpy as np

def cos_numpy(a: np.ndarray, b: np.ndarray) -> float:
    ab, a2, b2 = np.dot(a, b), np.dot(a, a), np.dot(b, b) # Fused in NumKong
    if a2 == 0 and b2 == 0: result = 0                    # Same in SciPy
    elif ab == 0: result = 1                              # Division by zero error in SciPy
    else: result = 1 - ab / (sqrt(a2) * sqrt(b2))         # Bigger rounding error in SciPy
    return result

In SciPy, however, the cosine distance is computed as 1 - ab / np.sqrt(a2 * b2). It handles the edge case of a zero and non-zero argument pair differently, resulting in a division by zero error. It's not only less efficient, but also less accurate, given how the reciprocal square roots are computed. The C standard library provides the sqrt function, which is generally very accurate, but slow. The rsqrt in-hardware implementations are faster, but have different accuracy characteristics.

  • SSE rsqrtps and AVX vrsqrtps: $1.5 \times 2^{-12}$ maximal relative error.
  • AVX-512 vrsqrt14pd instruction: $2^{-14}$ maximal relative error.
  • NEON frsqrte instruction has no documented error bounds, but can be $2^{-3}$.

To overcome the limitations of the rsqrt instruction, NumKong uses the Newton-Raphson iteration to refine the initial estimate for high-precision floating-point numbers. It can be defined as:

$$x_{n+1} = x_n \cdot (3 - x_n \cdot x_n) / 2$$

On 1536-dimensional inputs on Intel Sapphire Rapids CPU a single such iteration can result in a 2-3 orders of magnitude relative error reduction:

Datatype NumPy Error NumKong w/out Iteration NumKong
bfloat16 1.89e-08 ± 1.59e-08 3.07e-07 ± 3.09e-07 3.53e-09 ± 2.70e-09
float16 1.67e-02 ± 1.44e-02 2.68e-05 ± 1.95e-05 2.02e-05 ± 1.39e-05
float32 2.21e-08 ± 1.65e-08 3.47e-07 ± 3.49e-07 3.77e-09 ± 2.84e-09
float64 0.00e+00 ± 0.00e+00 3.80e-07 ± 4.50e-07 1.35e-11 ± 1.85e-11

Curved Spaces, Mahalanobis Distance, and Bilinear Quadratic Forms

The Mahalanobis distance is a generalization of the Euclidean distance, which takes into account the covariance of the data. It's very similar in its form to the bilinear form, which is a generalization of the dot product.

$$\text{BilinearForm}(a, b, M) = a^T M b$$ $$\text{Mahalanobis}(a, b, M) = \sqrt{(a - b)^T M^{-1} (a - b)}$$

Bilinear Forms can be seen as one of the most important linear algebraic operations, surprisingly missing in BLAS and LAPACK. They are versatile and appear in various domains:

  • In Quantum Mechanics, the expectation value of an observable $A$ in a state $\psi$ is given by $\langle \psi | A | \psi \rangle$, which is a bilinear form.
  • In Machine Learning, in Support Vector Machines (SVMs), bilinear forms define kernel functions that measure similarity between data points.
  • In Differential Geometry, the metric tensor, which defines distances and angles on a manifold, is a bilinear form on the tangent space.
  • In Economics, payoff functions in certain Game Theoretic problems can be modeled as bilinear forms of players' strategies.
  • In Physics, interactions between electric and magnetic fields can be expressed using bilinear forms.

Broad applications aside, the lack of a specialized primitive for bilinear forms in BLAS and LAPACK means significant performance overhead. A $vector * matrix * vector$ product is a scalar, whereas its constituent parts ($vector * matrix$ and $matrix * vector$) are vectors:

  • They need memory to be stored in: $O(n)$ allocation.
  • The data will be written to memory and read back, wasting CPU cycles.

NumKong doesn't produce intermediate vector results, like a @ M @ b, but computes the bilinear form directly.

Set Intersection, Galloping, and Binary Search

The set intersection operation is generally defined as the number of elements that are common between two sets, represented as sorted arrays of integers. The most common way to compute it is a linear scan:

size_t intersection_size(int *a, int *b, size_t n, size_t m) {
    size_t i = 0, j = 0, count = 0;
    while (i < n && j < m) {
        if (a[i] < b[j]) i++;
        else if (a[i] > b[j]) j++;
        else i++, j++, count++;
    }
    return count;
}

Alternatively, one can use the binary search to find the elements in the second array that are present in the first one. On every step the checked region of the second array is halved, which is called the galloping search. It's faster, but only when large arrays of very different sizes are intersected. Third approach is to use the SIMD instructions to compare multiple elements at once:

  • Using string-intersection instructions on x86, like pcmpestrm.
  • Using integer-intersection instructions in AVX-512, like vp2intersectd.
  • Using vanilla equality checks present in all SIMD instruction sets.

After benchmarking, the last approach was chosen, as it's the most flexible and often the fastest.

Complex Dot Products, Conjugate Dot Products, and Complex Numbers

Complex dot products are a generalization of the dot product to complex numbers. They are supported by most BLAS packages, but almost never in mixed precision. NumKong defines dot and vdot kernels as:

$$\text{dot}(a, b) = \sum_{i=0}^{n-1} a_i \cdot b_i$$ $$\text{vdot}(a, b) = \sum_{i=0}^{n-1} a_i \cdot \bar{b_i}$$

Where $\bar{b_i}$ is the complex conjugate of $b_i$. Putting that into Python code for scalar arrays:

def dot(a: List[number], b: List[number]) -> number:
    a_real, a_imaginary = a[0::2], a[1::2]
    b_real, b_imaginary = b[0::2], b[1::2]
    ab_real, ab_imaginary = 0, 0
    for ar, ai, br, bi in zip(a_real, a_imaginary, b_real, b_imaginary):
        ab_real += ar * br - ai * bi
        ab_imaginary += ar * bi + ai * br
    return ab_real, ab_imaginary

def vdot(a: List[number], b: List[number]) -> number:
    a_real, a_imaginary = a[0::2], a[1::2]
    b_real, b_imaginary = b[0::2], b[1::2]
    ab_real, ab_imaginary = 0, 0
    for ar, ai, br, bi in zip(a_real, a_imaginary, b_real, b_imaginary):
        ab_real += ar * br + ai * bi
        ab_imaginary += ar * bi - ai * br
    return ab_real, ab_imaginary

Logarithms in Kullback-Leibler & Jensen–Shannon Divergences

The Kullback-Leibler divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. Jensen-Shannon divergence is a symmetrized and smoothed version of the Kullback-Leibler divergence, which can be used as a distance metric between probability distributions.

$$\text{KL}(P || Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}$$ $$\text{JS}(P, Q) = \frac{1}{2} \text{KL}(P || M) + \frac{1}{2} \text{KL}(Q || M), M = \frac{P + Q}{2}$$

Both functions are defined for non-negative numbers, and the logarithm is a key part of their computation.

Mixed Precision in Fused-Multiply-Add and Weighted Sums

The Fused-Multiply-Add (FMA) operation is a single operation that combines element-wise multiplication and addition with different scaling factors. The Weighted Sum is its simplified variant without element-wise multiplication.

$$\text{FMA}_i(A, B, C, \alpha, \beta) = \alpha \cdot A_i \cdot B_i + \beta \cdot C_i$$ $$\text{WSum}_i(A, B, \alpha, \beta) = \alpha \cdot A_i + \beta \cdot B_i$$

In NumPy terms, the implementation may look like:

import numpy as np
def wsum(A: np.ndarray, B: np.ndarray, /, Alpha: float, Beta: float) -> np.ndarray:
    assert A.dtype == B.dtype, "Input types must match and affect the output style"
    return (Alpha * A + Beta * B).astype(A.dtype)
def fma(A: np.ndarray, B: np.ndarray, C: np.ndarray, /, Alpha: float, Beta: float) -> np.ndarray:
    assert A.dtype == B.dtype and A.dtype == C.dtype, "Input types must match and affect the output style"
    return (Alpha * A * B + Beta * C).astype(A.dtype)

The tricky part is implementing those operations in mixed precision, where the scaling factors are of different precision than the input and output vectors. NumKong uses double-precision floating-point scaling factors for any input and output precision, including i8 and u8 integers and f16 and bf16 floats. Depending on the generation of the CPU, given native support for f16 addition and multiplication, the f16 temporaries are used for i8 and u8 multiplication, scaling, and addition. For bf16, native support is generally limited to dot-products with subsequent partial accumulation, which is not enough for the FMA and WSum operations, so f32 is used as a temporary.

Auto-Vectorization & Loop Unrolling

On the Intel Sapphire Rapids platform, NumKong was benchmarked against auto-vectorized code using GCC 12. GCC handles single-precision float but might not be the best choice for int8 and _Float16 arrays, which have been part of the C language since 2011.

Kind GCC 12 f32 GCC 12 f16 NumKong f16 f16 improvement
Inner Product 3,810 K/s 192 K/s 5,990 K/s 31 x
Cosine Distance 3,280 K/s 336 K/s 6,880 K/s 20 x
Euclidean Distance ² 4,620 K/s 147 K/s 5,320 K/s 36 x
Jensen-Shannon Divergence 1,180 K/s 18 K/s 2,140 K/s 118 x

Dynamic Dispatch

Most popular software is precompiled and distributed with fairly conservative CPU optimizations, to ensure compatibility with older hardware. Database Management platforms, like ClickHouse, and Web Browsers, like Google Chrome,need to run on billions of devices, and they can't afford to be picky about the CPU features. For such users NumKong provides a dynamic dispatch mechanism, which selects the most advanced micro-kernel for the current CPU at runtime.

Subset F CD ER PF 4FMAPS 4VNNIW VPOPCNTDQ VL DQ BW IFMA VBMI VNNI BF16 VBMI2 BITALG VPCLMULQDQ GFNI VAES VP2INTERSECT FP16
Knights Landing (Xeon Phi x200, 2016) Yes Yes No
Knights Mill (Xeon Phi x205, 2017) Yes No
Skylake-SP, Skylake-X (2017) No No Yes No
Cannon Lake (2018) Yes No
Cascade Lake (2019) No Yes No
Cooper Lake (2020) Yes No
Ice Lake (2019) Yes No Yes No
Tiger Lake (2020) Yes No
Rocket Lake (2021) No
Alder Lake (2021) Partial Partial
Zen 4 (2022) Yes Yes No
Sapphire Rapids (2023) No Yes
Zen 5 (2024) Yes No

You can compile NumKong on an old CPU, like Intel Haswell, and run it on a new one, like AMD Genoa, and it will automatically use the most advanced instructions available. Reverse is also true, you can compile on a new CPU and run on an old one, and it will automatically fall back to the most basic instructions. Moreover, the very first time you prove for CPU capabilities with nk_capabilities(), it initializes the dynamic dispatch mechanism, and all subsequent calls will be faster and won't face race conditions in multi-threaded environments.

Target Specific Backends

NumKong exposes all kernels for all backends, and you can select the most advanced one for the current CPU without relying on built-in dispatch mechanisms. That's handy for testing and benchmarking, but also in case you want to dispatch a very specific kernel for a very specific CPU, bypassing NumKong assignment logic. All of the function names follow the same pattern: nk_{function}_{type}_{backend}.

  • The backend can be serial, haswell, skylake, ice, genoa, sapphire, turin, neon, or sve.
  • The type can be f64, f32, f16, bf16, f64c, f32c, f16c, bf16c, i8, or b8.
  • The function can be dot, vdot, cos, l2sq, hamming, jaccard, kl, js, or intersect.

To avoid hard-coding the backend, you can use the nk_kernel_punned_t to pun the function pointer and the nk_capabilities function to get the available backends at runtime. To match all the function names, consider a RegEx:

NK_PUBLIC void nk_\w+_\w+_\w+\(

On Linux, you can use the following command to list all unique functions:

$ grep -oP 'NK_PUBLIC void nk_\w+_\w+_\w+\(' include/numkong/*.h | sort | uniq
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_haswell(
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_ice(
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_neon(
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_serial(
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_sve(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_haswell(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_ice(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_neon(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_serial(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_sve(

License

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.