Vector math benchmarks for Elixir libs

heads

tip
browse log

clone

read-only
https://hg.sr.ht/~icefox/elixir_mathbench
read/write
ssh://[email protected]/~icefox/elixir_mathbench

#Elixir Mathbench

Crude benchmarks for various Elixir vector math libs. There's a fair amount of stuff written about crunching big matrices with neural-net algorithms using Nx and stuff, but the more I look at that stuff the more it becomes clear that it is definitely not what I want. I want to do physics and graphics, which means crunching lots and lots of relatively small vectors and matrices that can change rapidly, not offloading giant chonks of numbers to the GPU and waiting for it to get back to you days later. So I am curious about the speed of doing this gamedev-type numerical stuff on the Erlang VM, how it's affected by different implementation choices, and how the progress of the BEAM VM's JIT has changed it. Erlang is not really designed to be good at numerical stuff, it's designed to be good at putting together and taking apart pointer-heavy data structures, but it's also traditionally kinda faster in general than I expect it to be. I did some very crude benchmarks and was very surprised by the results, so here I'm trying to figure out more about what's going on.

There's also a variety of implementation approaches taken by various vector math libs. Some of them use native data types, some of them wrap BLAS primitives using NIF's and write higher level stuff atop that in Elixir, some of them wrap higher-level C or Rust code in NIF's, etc. So I want to compare these approaches a little.

Disclaimer: These are not good benchmarks. This is a complicated topic with lots of fiddly nuance, and I don't have the patience to go through it in all the detail it would actually take. For example, there are known optimizer bugs in some recent versions of Elixir. I'm just curious about a ballpark view of the landscape. I'm also not going to look at GPU stuff, it adds too many variables.

The test harness used is Benchee.

Building and running:

# Install math deps
sudo apt install libopenblas-dev liblapacke-dev

# Setup ASDF
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.14.0
source ~/.asdf/asdf.fish

asdf plugin add erlang
asdf plugin add elixir
asdf plugin add rebar 

asdf install erlang 26.0.2
asdf install elixir 1.15.4-otp-26
asdf install rebar 3.23.0

asdf global rebar 3.23.0
asdf global erlang 26.0.2
asdf global elixir 1.15.4-otp-26

# Actually build the stuff
mix deps.get
iex -S mix

# In iex shell:
ElixirMathbench.run_all()

#Results

| Lib         | Representation | Computation engine |
|-------------|----------------|--------------------|
| Graphmath   | BEAM tuples    | BEAM               |
| Matrex      | BEAM binaries  | BLAS               |
| numerl      | BEAM binaries  | BLAS               |
| Nx (native) | BEAM binaries  | BEAM               |
| Nx (EXLA)   | ?              | XLA/CPU            |
 

| Lib         | Performance | Completeness | Ergonomics |
|-------------|-------------|--------------|------------|
| Graphmath   | B+          | A-           | B          |
| Matrex      | B           | C+           | A-         |
| numerl      | C           | B            | B+         |
| Nx (native) | F           | C-           | A-         |
| Nx (EXLA)   | F           | C-           | A-         |

The real result of this is the realization that Graphmath is the only lib among them that actually has all the operations relevant for graphics and physics: Cross product, quaternion rotations, stuff like that. And that Benchee is pretty damn slick.

So, if you are doing lots of math to bunches of small vectors, it appears you're almost always best off getting out of the way of the JIT and letting it optimize your math for you. There's a few cases where the native-code libs do significantly better, namely matrix multiplies, so those might be worth investigating further and trying to optimize, but Graphmath's "tuples full of numbers" representation works absurdly well in most cases, compared to the overhead necessary for FFI calls to native functions. ...That said, some very rough and stupid playing with Rust and the glam crate loooooooooks like it did vec4*mat4 multiplications about 200x faster than Graphmath, which is also pretty impressive.

Possible future work: Big matrices, small matrices packed into big ones, run with and without JIT, test more libs, test against other langs (numpy and rust mainly)

#Possible confounding factors

Every benchmark is the given operation applied to a list of 1000 values with Enum.map(). That has some overhead, but I tried bencharking these libs on single operations (lib/elixir_mathbench_single.ex) and got pretty similar results. But in a real system there are probably gains to be had by using a denser data structure of some kind or another, let alone things like column-major layouts.

I have not checked whether any of these libraries actually gives correct results. I guess I trust them to get their basics right.

Creating/destroying matrices is not benchmarked; all the creation of inputs is done outside the benchmark measurement functions. However they do generally have to create a new matrix for outputs most of the time, which might be why some things like EXLA are so slow. If something is designed to operate on the GPU, it's going to assume that allocating memory is always slow, and probably not try too hard to make it a super-optimal design.

#Detailed results

#System

Benchmark suite executing on the following system:

Operating System Linux
CPU Information AMD Ryzen 5 7640U w
Number of Available Cores 12
Available Memory 14.94 GB
Elixir Version 1.15.4
Erlang Version 26.0.2

#Configuration

Benchmark suite executing with the following configuration:

:time 5 s
:parallel 1
:warmup 2 s

#Benchmark vec3_add

Run Time

Name IPS Average Devitation Median 99th %
graphmath_vec3_add 26.31 K 38.00 µs ±51.24% 28.41 µs 97.75 µs
matrex_vec3_add 13.64 K 73.34 µs ±21.57% 72.37 µs 107.58 µs
numerl_vec3_add 6.40 K 156.25 µs ±23.12% 175.50 µs 211.31 µs
nx_vec3_add 1.26 K 793.05 µs ±22.67% 754.82 µs 1854.12 µs
exla_vec3_add 0.0714 K 14014.31 µs ±10.38% 13743.83 µs 22279.93 µs

Run Time Comparison

Name IPS Slower
graphmath_vec3_add 26.31 K  
matrex_vec3_add 13.64 K 1.93x
numerl_vec3_add 6.40 K 4.11x
nx_vec3_add 1.26 K 20.87x
exla_vec3_add 0.0714 K 368.77x

Memory Usage

Name Average Factor
graphmath_vec3_add 46.92 KB  
matrex_vec3_add 172.01 KB 3.67x
numerl_vec3_add 140.52 KB 2.99x
nx_vec3_add 1446.76 KB 30.83x
exla_vec3_add 6618.74 KB 141.06x

#Benchmark vec3_cross

Name IPS Average Devitation Median 99th %
graphmath_vec3_cross 13.98 K 71.54 µs ±17.46% 69.61 µs 91.90 µs
matrex_vec3_cross 10.76 K 92.96 µs ±25.88% 91.08 µs 177.99 µs
numerl_vec3_cross 2.97 K 337.20 µs ±20.76% 324.80 µs 533.70 µs

Run Time Comparison

Name IPS Slower
graphmath_vec3_cross 13.98 K  
matrex_vec3_cross 10.76 K 1.3x
numerl_vec3_cross 2.97 K 4.71x

Memory Usage

Name Average Factor
graphmath_vec3_cross 46.92 KB  
matrex_vec3_cross 250.21 KB 5.33x
numerl_vec3_cross 265.60 KB 5.66x

#Benchmark vec3_dot

Run Time

Name IPS Average Devitation Median 99th %
graphmath_vec3_dot 23.48 K 42.59 µs ±20.84% 40.84 µs 67.27 µs
matrex_vec3_dot 9.29 K 107.59 µs ±49.69% 97.70 µs 273.42 µs
numerl_vec3_dot 3.64 K 274.99 µs ±25.39% 264.57 µs 624.24 µs
nx_vec3_dot 0.189 K 5289.58 µs ±14.25% 4900.31 µs 8043.57 µs
exla_vec3_dot 0.0567 K 17636.18 µs ±15.06% 17257.14 µs 25854.61 µs

Run Time Comparison

Name IPS Slower
graphmath_vec3_dot 23.48 K  
matrex_vec3_dot 9.29 K 2.53x
numerl_vec3_dot 3.64 K 6.46x
nx_vec3_dot 0.189 K 124.19x
exla_vec3_dot 0.0567 K 414.08x

Memory Usage

Name Average Factor
graphmath_vec3_dot 15.64 KB  
matrex_vec3_dot 351.88 KB 22.5x
numerl_vec3_dot 249.79 KB 15.97x
nx_vec3_dot 11308.17 KB 723.0x
exla_vec3_dot 7379.56 KB 471.82x

#Benchmark quat_rotations

Run Time

Name IPS Average Devitation Median 99th %
graphmath_vec3_rotate 4.22 K 0.24 ms ±25.27% 0.21 ms 0.42 ms
matrex_vec3_rotate 0.89 K 1.12 ms ±37.38% 1.02 ms 3.25 ms

Run Time Comparison

Name IPS Slower
graphmath_vec3_rotate 4.22 K  
matrex_vec3_rotate 0.89 K 4.75x

Memory Usage

Name Average Factor
graphmath_vec3_rotate 0.21 MB  
matrex_vec3_rotate 1.90 MB 8.89x

#Benchmark mat4_add

Run Time

Name IPS Average Devitation Median 99th %
matrex_mat4_add 7.96 K 125.60 µs ±21.26% 123.40 µs 145.80 µs
graphmath_mat4_add 7.56 K 132.20 µs ±25.39% 119.69 µs 206.39 µs
numerl_mat4_add 6.05 K 165.20 µs ±27.22% 178.69 µs 223.69 µs
nx_mat4_add 0.37 K 2711.80 µs ±7.42% 2657.09 µs 3423.86 µs
exla_mat4_add 0.0679 K 14725.47 µs ±6.70% 14674.08 µs 17505.55 µs

Run Time Comparison

Name IPS Slower
matrex_mat4_add 7.96 K  
graphmath_mat4_add 7.56 K 1.05x
numerl_mat4_add 6.05 K 1.32x
nx_mat4_add 0.37 K 21.59x
exla_mat4_add 0.0679 K 117.24x

Memory Usage

Name Average Factor
matrex_mat4_add 179.82 KB  
graphmath_mat4_add 148.45 KB 0.83x
numerl_mat4_add 101.45 KB 0.56x
nx_mat4_add 5012.82 KB 27.88x
exla_mat4_add 6656.80 KB 37.02x

#Benchmark mat4_mul

Run Time

Name IPS Average Devitation Median 99th %
matrex_mat4_mul 9725.33 0.103 ms ±16.67% 0.101 ms 0.156 ms
numerl_mat4_mul 5989.04 0.167 ms ±20.75% 0.174 ms 0.22 ms
graphmath_mat4_mul 976.11 1.02 ms ±20.31% 1.03 ms 1.48 ms
nx_mat4_mul 352.77 2.83 ms ±5.60% 2.85 ms 3.25 ms
ex_mat4_mul 67.07 14.91 ms ±7.39% 14.69 ms 18.76 ms

Run Time Comparison

Name IPS Slower
matrex_mat4_mul 9725.33  
numerl_mat4_mul 5989.04 1.62x
graphmath_mat4_mul 976.11 9.96x
nx_mat4_mul 352.77 27.57x
ex_mat4_mul 67.07 144.99x

Memory Usage

Name Average Factor
matrex_mat4_mul 101.62 KB  
numerl_mat4_mul 101.45 KB 1.0x
graphmath_mat4_mul 148.45 KB 1.46x
nx_mat4_mul 5012.82 KB 49.33x
ex_mat4_mul 6656.80 KB 65.51x

#Benchmark vec4_mat4_mul

Run Time

Name IPS Average Devitation Median 99th %
matrex_vm4_mul 12.84 K 77.90 µs ±20.77% 77.67 µs 104.72 µs
graphmath_vm4_mul 9.11 K 109.80 µs ±17.87% 106.87 µs 138.93 µs
numerl_vm4_mul 5.67 K 176.40 µs ±32.50% 185.72 µs 266.17 µs
nx_vm4_mul 0.23 K 4416.87 µs ±13.26% 4215.49 µs 7294.22 µs
exla_vm4_mul 0.0553 K 18083.56 µs ±17.79% 16789.01 µs 26136.27 µs

Run Time Comparison

Name IPS Slower
matrex_vm4_mul 12.84 K  
graphmath_vm4_mul 9.11 K 1.41x
numerl_vm4_mul 5.67 K 2.26x
nx_vm4_mul 0.23 K 56.7x
exla_vm4_mul 0.0553 K 232.15x

Memory Usage

Name Average Factor
matrex_vm4_mul 172.01 KB  
graphmath_vm4_mul 46.92 KB 0.27x
numerl_vm4_mul 148.33 KB 0.86x
nx_vm4_mul 13224.15 KB 76.88x
exla_vm4_mul 7434.05 KB 43.22x

#Benchmark mat4_transpose

Run Time

Name IPS Average Devitation Median 99th %
graphmath_mat4_transp 45.82 K 21.83 µs ±74.32% 12.70 µs 53.05 µs
matrex_mat4_transp 9.24 K 108.18 µs ±35.03% 128.34 µs 152.43 µs
numerl_mat4_transp 8.02 K 124.65 µs ±31.31% 141.40 µs 169.01 µs
nx_mat4_transp 0.53 K 1883.57 µs ±6.70% 1818.93 µs 2214.72 µs
exla_mat4_transp 0.0662 K 15105.50 µs ±15.89% 13847.66 µs 20533.12 µs

Run Time Comparison

Name IPS Slower
graphmath_mat4_transp 45.82 K  
matrex_mat4_transp 9.24 K 4.96x
numerl_mat4_transp 8.02 K 5.71x
nx_mat4_transp 0.53 K 86.3x
exla_mat4_transp 0.0662 K 692.09x

Memory Usage

Name Average Factor
graphmath_mat4_transp 148.59 KB  
matrex_mat4_transp 140.67 KB 0.95x
numerl_mat4_transp 101.45 KB 0.68x
nx_mat4_transp 3863.23 KB 26.0x
exla_mat4_transp 6155.34 KB 41.43x