Computing dot-products, similarity measures, and distances between low- and high-dimensional vectors is ubiquitous in Machine Learning, Scientific Computing, Geospatial Analysis, and Information Retrieval.
These algorithms generally have linear complexity in time, constant or linear complexity in space, and are data-parallel.
In other words, it is easily parallelizable and vectorizable and often available in packages like BLAS (level 1) and LAPACK, as well as higher-level numpy and scipy Python libraries.
Ironically, even with decades of evolution in compilers and numerical computing, most libraries can be 3-200x slower than hardware potential even on the most popular hardware, like 64-bit x86 and Arm CPUs.
Moreover, most lack mixed-precision support, which is crucial for modern AI!
The rare few that support minimal mixed precision, run only on one platform, and are vendor-locked, by companies like Intel and Nvidia.
NumKong provides an alternative.
1️⃣ NumKong functions are practically as fast as memcpy.
2️⃣ Unlike BLAS, most kernels are designed for mixed-precision and bit-level operations.
3️⃣ NumKong often ships more binaries than NumPy and has more backends than most BLAS implementations, and more high-level interfaces than most libraries.
NumKong (Arabic: "سيمسيم دي") is a mixed-precision math library of over 350 SIMD-optimized kernels extensively used in AI, Search, and DBMS workloads. Named after the iconic "Open Sesame" command that opened doors to treasure in Ali Baba and the Forty Thieves, NumKong can help you 10x the cost-efficiency of your computational pipelines. Implemented distance functions include:
- Euclidean (L2) and Cosine (Angular) spatial distances for Vector Search. docs
- Dot-Products for real & complex vectors for DSP & Quantum computing. docs
- Hamming (~ Manhattan) and Jaccard (~ Tanimoto) bit-level distances. docs
- Set Intersections for Sparse Vectors and Text Analysis. docs
- Mahalanobis distance and Quadratic forms for Scientific Computing. docs
- Kullback-Leibler and Jensen–Shannon divergences for probability distributions. docs
- Fused-Multiply-Add (FMA) and Weighted Sums to replace BLAS level 1 functions. docs
- For Levenshtein, Needleman–Wunsch, and Smith-Waterman, check StringZilla.
- 🔜 Haversine and Vincenty's formulae for Geospatial Analysis.
Moreover, NumKong...
- handles
float64,float32,float16, andbfloat16real & complex vectors. - handles
int8integral,int4sub-byte, andb8binary vectors. - handles sparse
uint32anduint16sets, and weighted sparse vectors. - is a zero-dependency header-only C 99 library.
- has Python, Rust, JS, and Swift bindings.
- has Arm backends for NEON, Scalable Vector Extensions (SVE), and SVE2.
- has x86 backends for Haswell, Skylake, Ice Lake, Genoa, and Sapphire Rapids.
- with both compile-time and runtime CPU feature detection easily integrates anywhere!
Due to the high-level of fragmentation of SIMD support in different x86 CPUs, NumKong generally uses the names of select Intel CPU generations for its backends. They, however, also work on AMD CPUs. Intel Haswell is compatible with AMD Zen 1/2/3, while AMD Genoa Zen 4 covers AVX-512 instructions added to Intel Skylake and Ice Lake. You can learn more about the technical implementation details in the following blog-posts:
- Uses Horner's method for polynomial approximations, beating GCC 12 by 119x.
- Uses Arm SVE and x86 AVX-512's masked loads to eliminate tail
for-loops. - Substitutes libc's
sqrtwith Newton Raphson iterations. - Uses Galloping and SVE2 histograms to intersect sparse vectors.
- For Python: avoids slow PyBind11, SWIG, &
PyArg_ParseTupleusing faster calling convention. - For JavaScript: uses typed arrays and NAPI for zero-copy calls.
| NumPy | C 99 | NumKong | |
|---|---|---|---|
cosine distances between 1536d vectors in int8 |
|||
|
🚧 overflows |
x86: 10,548,600 ops/s arm: 11,379,300 ops/s |
x86: 16,151,800 ops/s arm: 13,524,000 ops/s |
|
cosine distances between 1536d vectors in bfloat16 |
|||
|
🚧 not supported |
x86: 119,835 ops/s arm: 403,909 ops/s |
x86: 9,738,540 ops/s arm: 4,881,900 ops/s |
|
cosine distances between 1536d vectors in float16 |
|||
|
x86: 40,481 ops/s arm: 21,451 ops/s |
x86: 501,310 ops/s arm: 871,963 ops/s |
x86: 7,627,600 ops/s arm: 3,316,810 ops/s |
|
cosine distances between 1536d vectors in float32 |
|||
|
x86: 253,902 ops/s arm: 46,394 ops/s |
x86: 882,484 ops/s arm: 399,661 ops/s |
x86: 8,202,910 ops/s arm: 3,400,620 ops/s |
|
cosine distances between 1536d vectors in float64 |
|||
|
x86: 212,421 ops/s arm: 52,904 ops/s |
x86: 839,301 ops/s arm: 837,126 ops/s |
x86: 1,538,530 ops/s arm: 1,678,920 ops/s |
|
euclidean distance between 1536d vectors in int8 |
|||
|
x86: 252,113 ops/s arm: 177,443 ops/s |
x86: 6,690,110 ops/s arm: 4,114,160 ops/s |
x86: 18,989,000 ops/s arm: 18,878,200 ops/s |
|
euclidean distance between 1536d vectors in bfloat16 |
|||
|
🚧 not supported |
x86: 119,842 ops/s arm: 1,049,230 ops/s |
x86: 9,727,210 ops/s arm: 4,233,420 ops/s |
|
euclidean distance between 1536d vectors in float16 |
|||
|
x86: 54,621 ops/s arm: 71,793 ops/s |
x86: 196,413 ops/s arm: 911,370 ops/s |
x86: 19,466,800 ops/s arm: 3,522,760 ops/s |
|
euclidean distance between 1536d vectors in float32 |
|||
|
x86: 424,944 ops/s arm: 292,629 ops/s |
x86: 1,295,210 ops/s arm: 1,055,940 ops/s |
x86: 8,924,100 ops/s arm: 3,602,650 ops/s |
|
euclidean distance between 1536d vectors in float64 |
|||
|
x86: 334,929 ops/s arm: 237,505 ops/s |
x86: 1,215,190 ops/s arm: 905,782 ops/s |
x86: 1,701,740 ops/s arm: 1,735,840 ops/s |
|
For benchmarks we mostly use 1536-dimensional vectors, like the embeddings produced by the OpenAI Ada API. The code was compiled with GCC 12, using glibc v2.35. The benchmarks performed on Arm-based Graviton3 AWS
c7ginstances andr7izIntel Sapphire Rapids. Most modern Arm-based 64-bit CPUs will have similar relative speedups. Variance within x86 CPUs will be larger.
Similar speedups are often observed even when compared to BLAS and LAPACK libraries underlying most numerical computing libraries, including NumPy and SciPy in Python. Broader benchmarking results:
In general there are a few principles that NumKong follows:
- Avoid loop unrolling.
- Never allocate memory.
- Never throw exceptions or set
errno. - Keep all function arguments the size of the pointer.
- Avoid returning from public interfaces, use out-arguments instead.
- Don't over-optimize for old CPUs and single- and double-precision floating-point numbers.
- Prioritize mixed-precision and integer operations, and new ISA extensions.
- Prefer saturated arithmetic and avoid overflows.
Possibly, in the future:
- Best effort computation silencing
NaNcomponents in low-precision inputs. - Detect overflows and report the distance with a "signaling"
NaN.
Last, but not the least - don't build unless there is a demand for it. So if you have a specific use-case, please open an issue or a pull request, and ideally, bring in more users with similar needs.
The cosine similarity is the most common and straightforward metric used in machine learning and information retrieval. Interestingly, there are multiple ways to shoot yourself in the foot when computing it. The cosine similarity is the inverse of the cosine distance, which is the cosine of the angle between two vectors.
In NumPy terms, NumKong implementation is similar to:
import numpy as np
def cos_numpy(a: np.ndarray, b: np.ndarray) -> float:
ab, a2, b2 = np.dot(a, b), np.dot(a, a), np.dot(b, b) # Fused in NumKong
if a2 == 0 and b2 == 0: result = 0 # Same in SciPy
elif ab == 0: result = 1 # Division by zero error in SciPy
else: result = 1 - ab / (sqrt(a2) * sqrt(b2)) # Bigger rounding error in SciPy
return resultIn SciPy, however, the cosine distance is computed as 1 - ab / np.sqrt(a2 * b2).
It handles the edge case of a zero and non-zero argument pair differently, resulting in a division by zero error.
It's not only less efficient, but also less accurate, given how the reciprocal square roots are computed.
The C standard library provides the sqrt function, which is generally very accurate, but slow.
The rsqrt in-hardware implementations are faster, but have different accuracy characteristics.
- SSE
rsqrtpsand AVXvrsqrtps:$1.5 \times 2^{-12}$ maximal relative error. - AVX-512
vrsqrt14pdinstruction:$2^{-14}$ maximal relative error. - NEON
frsqrteinstruction has no documented error bounds, but can be$2^{-3}$ .
To overcome the limitations of the rsqrt instruction, NumKong uses the Newton-Raphson iteration to refine the initial estimate for high-precision floating-point numbers.
It can be defined as:
On 1536-dimensional inputs on Intel Sapphire Rapids CPU a single such iteration can result in a 2-3 orders of magnitude relative error reduction:
| Datatype | NumPy Error | NumKong w/out Iteration | NumKong |
|---|---|---|---|
bfloat16 |
1.89e-08 ± 1.59e-08 | 3.07e-07 ± 3.09e-07 | 3.53e-09 ± 2.70e-09 |
float16 |
1.67e-02 ± 1.44e-02 | 2.68e-05 ± 1.95e-05 | 2.02e-05 ± 1.39e-05 |
float32 |
2.21e-08 ± 1.65e-08 | 3.47e-07 ± 3.49e-07 | 3.77e-09 ± 2.84e-09 |
float64 |
0.00e+00 ± 0.00e+00 | 3.80e-07 ± 4.50e-07 | 1.35e-11 ± 1.85e-11 |
The Mahalanobis distance is a generalization of the Euclidean distance, which takes into account the covariance of the data. It's very similar in its form to the bilinear form, which is a generalization of the dot product.
Bilinear Forms can be seen as one of the most important linear algebraic operations, surprisingly missing in BLAS and LAPACK. They are versatile and appear in various domains:
- In Quantum Mechanics, the expectation value of an observable
$A$ in a state$\psi$ is given by$\langle \psi | A | \psi \rangle$ , which is a bilinear form. - In Machine Learning, in Support Vector Machines (SVMs), bilinear forms define kernel functions that measure similarity between data points.
- In Differential Geometry, the metric tensor, which defines distances and angles on a manifold, is a bilinear form on the tangent space.
- In Economics, payoff functions in certain Game Theoretic problems can be modeled as bilinear forms of players' strategies.
- In Physics, interactions between electric and magnetic fields can be expressed using bilinear forms.
Broad applications aside, the lack of a specialized primitive for bilinear forms in BLAS and LAPACK means significant performance overhead.
A
- They need memory to be stored in:
$O(n)$ allocation. - The data will be written to memory and read back, wasting CPU cycles.
NumKong doesn't produce intermediate vector results, like a @ M @ b, but computes the bilinear form directly.
The set intersection operation is generally defined as the number of elements that are common between two sets, represented as sorted arrays of integers. The most common way to compute it is a linear scan:
size_t intersection_size(int *a, int *b, size_t n, size_t m) {
size_t i = 0, j = 0, count = 0;
while (i < n && j < m) {
if (a[i] < b[j]) i++;
else if (a[i] > b[j]) j++;
else i++, j++, count++;
}
return count;
}Alternatively, one can use the binary search to find the elements in the second array that are present in the first one. On every step the checked region of the second array is halved, which is called the galloping search. It's faster, but only when large arrays of very different sizes are intersected. Third approach is to use the SIMD instructions to compare multiple elements at once:
- Using string-intersection instructions on x86, like
pcmpestrm. - Using integer-intersection instructions in AVX-512, like
vp2intersectd. - Using vanilla equality checks present in all SIMD instruction sets.
After benchmarking, the last approach was chosen, as it's the most flexible and often the fastest.
Complex dot products are a generalization of the dot product to complex numbers.
They are supported by most BLAS packages, but almost never in mixed precision.
NumKong defines dot and vdot kernels as:
Where
def dot(a: List[number], b: List[number]) -> number:
a_real, a_imaginary = a[0::2], a[1::2]
b_real, b_imaginary = b[0::2], b[1::2]
ab_real, ab_imaginary = 0, 0
for ar, ai, br, bi in zip(a_real, a_imaginary, b_real, b_imaginary):
ab_real += ar * br - ai * bi
ab_imaginary += ar * bi + ai * br
return ab_real, ab_imaginary
def vdot(a: List[number], b: List[number]) -> number:
a_real, a_imaginary = a[0::2], a[1::2]
b_real, b_imaginary = b[0::2], b[1::2]
ab_real, ab_imaginary = 0, 0
for ar, ai, br, bi in zip(a_real, a_imaginary, b_real, b_imaginary):
ab_real += ar * br + ai * bi
ab_imaginary += ar * bi - ai * br
return ab_real, ab_imaginaryThe Kullback-Leibler divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. Jensen-Shannon divergence is a symmetrized and smoothed version of the Kullback-Leibler divergence, which can be used as a distance metric between probability distributions.
Both functions are defined for non-negative numbers, and the logarithm is a key part of their computation.
The Fused-Multiply-Add (FMA) operation is a single operation that combines element-wise multiplication and addition with different scaling factors. The Weighted Sum is its simplified variant without element-wise multiplication.
In NumPy terms, the implementation may look like:
import numpy as np
def wsum(A: np.ndarray, B: np.ndarray, /, Alpha: float, Beta: float) -> np.ndarray:
assert A.dtype == B.dtype, "Input types must match and affect the output style"
return (Alpha * A + Beta * B).astype(A.dtype)
def fma(A: np.ndarray, B: np.ndarray, C: np.ndarray, /, Alpha: float, Beta: float) -> np.ndarray:
assert A.dtype == B.dtype and A.dtype == C.dtype, "Input types must match and affect the output style"
return (Alpha * A * B + Beta * C).astype(A.dtype)The tricky part is implementing those operations in mixed precision, where the scaling factors are of different precision than the input and output vectors.
NumKong uses double-precision floating-point scaling factors for any input and output precision, including i8 and u8 integers and f16 and bf16 floats.
Depending on the generation of the CPU, given native support for f16 addition and multiplication, the f16 temporaries are used for i8 and u8 multiplication, scaling, and addition.
For bf16, native support is generally limited to dot-products with subsequent partial accumulation, which is not enough for the FMA and WSum operations, so f32 is used as a temporary.
On the Intel Sapphire Rapids platform, NumKong was benchmarked against auto-vectorized code using GCC 12.
GCC handles single-precision float but might not be the best choice for int8 and _Float16 arrays, which have been part of the C language since 2011.
| Kind | GCC 12 f32 |
GCC 12 f16 |
NumKong f16 |
f16 improvement |
|---|---|---|---|---|
| Inner Product | 3,810 K/s | 192 K/s | 5,990 K/s | 31 x |
| Cosine Distance | 3,280 K/s | 336 K/s | 6,880 K/s | 20 x |
| Euclidean Distance ² | 4,620 K/s | 147 K/s | 5,320 K/s | 36 x |
| Jensen-Shannon Divergence | 1,180 K/s | 18 K/s | 2,140 K/s | 118 x |
Most popular software is precompiled and distributed with fairly conservative CPU optimizations, to ensure compatibility with older hardware. Database Management platforms, like ClickHouse, and Web Browsers, like Google Chrome,need to run on billions of devices, and they can't afford to be picky about the CPU features. For such users NumKong provides a dynamic dispatch mechanism, which selects the most advanced micro-kernel for the current CPU at runtime.
| Subset | F | CD | ER | PF | 4FMAPS | 4VNNIW | VPOPCNTDQ | VL | DQ | BW | IFMA | VBMI | VNNI | BF16 | VBMI2 | BITALG | VPCLMULQDQ | GFNI | VAES | VP2INTERSECT | FP16 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Knights Landing (Xeon Phi x200, 2016) | Yes | Yes | No | ||||||||||||||||||
| Knights Mill (Xeon Phi x205, 2017) | Yes | No | |||||||||||||||||||
| Skylake-SP, Skylake-X (2017) | No | No | Yes | No | |||||||||||||||||
| Cannon Lake (2018) | Yes | No | |||||||||||||||||||
| Cascade Lake (2019) | No | Yes | No | ||||||||||||||||||
| Cooper Lake (2020) | Yes | No | |||||||||||||||||||
| Ice Lake (2019) | Yes | No | Yes | No | |||||||||||||||||
| Tiger Lake (2020) | Yes | No | |||||||||||||||||||
| Rocket Lake (2021) | No | ||||||||||||||||||||
| Alder Lake (2021) | Partial | Partial | |||||||||||||||||||
| Zen 4 (2022) | Yes | Yes | No | ||||||||||||||||||
| Sapphire Rapids (2023) | No | Yes | |||||||||||||||||||
| Zen 5 (2024) | Yes | No | |||||||||||||||||||
You can compile NumKong on an old CPU, like Intel Haswell, and run it on a new one, like AMD Genoa, and it will automatically use the most advanced instructions available.
Reverse is also true, you can compile on a new CPU and run on an old one, and it will automatically fall back to the most basic instructions.
Moreover, the very first time you prove for CPU capabilities with nk_capabilities(), it initializes the dynamic dispatch mechanism, and all subsequent calls will be faster and won't face race conditions in multi-threaded environments.
NumKong exposes all kernels for all backends, and you can select the most advanced one for the current CPU without relying on built-in dispatch mechanisms.
That's handy for testing and benchmarking, but also in case you want to dispatch a very specific kernel for a very specific CPU, bypassing NumKong assignment logic.
All of the function names follow the same pattern: nk_{function}_{type}_{backend}.
- The backend can be
serial,haswell,skylake,ice,genoa,sapphire,turin,neon, orsve. - The type can be
f64,f32,f16,bf16,f64c,f32c,f16c,bf16c,i8, orb8. - The function can be
dot,vdot,cos,l2sq,hamming,jaccard,kl,js, orintersect.
To avoid hard-coding the backend, you can use the nk_kernel_punned_t to pun the function pointer and the nk_capabilities function to get the available backends at runtime.
To match all the function names, consider a RegEx:
NK_PUBLIC void nk_\w+_\w+_\w+\(On Linux, you can use the following command to list all unique functions:
$ grep -oP 'NK_PUBLIC void nk_\w+_\w+_\w+\(' include/numkong/*.h | sort | uniq
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_haswell(
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_ice(
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_neon(
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_serial(
> include/numkong/binary.h:NK_PUBLIC void nk_hamming_b8_sve(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_haswell(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_ice(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_neon(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_serial(
> include/numkong/binary.h:NK_PUBLIC void nk_jaccard_b8_sve(Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.
