~icefox/elixir_mathbench

#Elixir Mathbench

Crude benchmarks for various Elixir vector math libs. There's a fair amount of stuff written about crunching big matrices with neural-net algorithms using Nx and stuff, but the more I look at that stuff the more it becomes clear that it is definitely not what I want. I want to do physics and graphics, which means crunching lots and lots of relatively small vectors and matrices that can change rapidly, not offloading giant chonks of numbers to the GPU and waiting for it to get back to you days later. So I am curious about the speed of doing this gamedev-type numerical stuff on the Erlang VM, how it's affected by different implementation choices, and how the progress of the BEAM VM's JIT has changed it. Erlang is not really designed to be good at numerical stuff, it's designed to be good at putting together and taking apart pointer-heavy data structures, but it's also traditionally kinda faster in general than I expect it to be. I did some very crude benchmarks and was very surprised by the results, so here I'm trying to figure out more about what's going on.

There's also a variety of implementation approaches taken by various vector math libs. Some of them use native data types, some of them wrap BLAS primitives using NIF's and write higher level stuff atop that in Elixir, some of them wrap higher-level C or Rust code in NIF's, etc. So I want to compare these approaches a little.

Disclaimer: These are not good benchmarks. This is a complicated topic with lots of fiddly nuance, and I don't have the patience to go through it in all the detail it would actually take. For example, there are known optimizer bugs in some recent versions of Elixir. I'm just curious about a ballpark view of the landscape. I'm also not going to look at GPU stuff, it adds too many variables.

The test harness used is Benchee.

Building and running:

# Install math deps
sudo apt install libopenblas-dev liblapacke-dev

# Setup ASDF
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.14.0
source ~/.asdf/asdf.fish

asdf plugin add erlang
asdf plugin add elixir
asdf plugin add rebar 

asdf install erlang 26.0.2
asdf install elixir 1.15.4-otp-26
asdf install rebar 3.23.0

asdf global rebar 3.23.0
asdf global erlang 26.0.2
asdf global elixir 1.15.4-otp-26

# Actually build the stuff
mix deps.get
iex -S mix

# In iex shell:
ElixirMathbench.run_all()

#Results

| Lib         | Representation | Computation engine |
|-------------|----------------|--------------------|
| Graphmath   | BEAM tuples    | BEAM               |
| Matrex      | BEAM binaries  | BLAS               |
| numerl      | BEAM binaries  | BLAS               |
| Nx (native) | BEAM binaries  | BEAM               |
| Nx (EXLA)   | ?              | XLA/CPU            |
 

| Lib         | Performance | Completeness | Ergonomics |
|-------------|-------------|--------------|------------|
| Graphmath   | B+          | A-           | B          |
| Matrex      | B           | C+           | A-         |
| numerl      | C           | B            | B+         |
| Nx (native) | F           | C-           | A-         |
| Nx (EXLA)   | F           | C-           | A-         |

The real result of this is the realization that Graphmath is the only lib among them that actually has all the operations relevant for graphics and physics: Cross product, quaternion rotations, stuff like that. And that Benchee is pretty damn slick.

So, if you are doing lots of math to bunches of small vectors, it appears you're almost always best off getting out of the way of the JIT and letting it optimize your math for you. There's a few cases where the native-code libs do significantly better, namely matrix multiplies, so those might be worth investigating further and trying to optimize, but Graphmath's "tuples full of numbers" representation works absurdly well in most cases, compared to the overhead necessary for FFI calls to native functions. ...That said, some very rough and stupid playing with Rust and the glam crate loooooooooks like it did vec4*mat4 multiplications about 200x faster than Graphmath, which is also pretty impressive.

Possible future work: Big matrices, small matrices packed into big ones, run with and without JIT, test more libs, test against other langs (numpy and rust mainly)

#Possible confounding factors

Every benchmark is the given operation applied to a list of 1000 values with Enum.map(). That has some overhead, but I tried bencharking these libs on single operations (lib/elixir_mathbench_single.ex) and got pretty similar results. But in a real system there are probably gains to be had by using a denser data structure of some kind or another, let alone things like column-major layouts.

I have not checked whether any of these libraries actually gives correct results. I guess I trust them to get their basics right.

Creating/destroying matrices is not benchmarked; all the creation of inputs is done outside the benchmark measurement functions. However they do generally have to create a new matrix for outputs most of the time, which might be why some things like EXLA are so slow. If something is designed to operate on the GPU, it's going to assume that allocating memory is always slow, and probably not try too hard to make it a super-optimal design.

#Detailed results

#System

Benchmark suite executing on the following system:

Operating System	Linux
CPU Information	AMD Ryzen 5 7640U w
Number of Available Cores	12
Available Memory	14.94 GB
Elixir Version	1.15.4
Erlang Version	26.0.2

#Configuration

Benchmark suite executing with the following configuration:

:time	5 s
:parallel	1
:warmup	2 s

#Benchmark vec3_add

Run Time

Name	IPS	Average	Devitation	Median	99th %
graphmath_vec3_add	26.31 K	38.00 µs	±51.24%	28.41 µs	97.75 µs
matrex_vec3_add	13.64 K	73.34 µs	±21.57%	72.37 µs	107.58 µs
numerl_vec3_add	6.40 K	156.25 µs	±23.12%	175.50 µs	211.31 µs
nx_vec3_add	1.26 K	793.05 µs	±22.67%	754.82 µs	1854.12 µs
exla_vec3_add	0.0714 K	14014.31 µs	±10.38%	13743.83 µs	22279.93 µs

Run Time Comparison

Name	IPS	Slower
graphmath_vec3_add	26.31 K
matrex_vec3_add	13.64 K	1.93x
numerl_vec3_add	6.40 K	4.11x
nx_vec3_add	1.26 K	20.87x
exla_vec3_add	0.0714 K	368.77x

Memory Usage

Name	Average	Factor
graphmath_vec3_add	46.92 KB
matrex_vec3_add	172.01 KB	3.67x
numerl_vec3_add	140.52 KB	2.99x
nx_vec3_add	1446.76 KB	30.83x
exla_vec3_add	6618.74 KB	141.06x

#Benchmark vec3_cross

Name	IPS	Average	Devitation	Median	99th %
graphmath_vec3_cross	13.98 K	71.54 µs	±17.46%	69.61 µs	91.90 µs
matrex_vec3_cross	10.76 K	92.96 µs	±25.88%	91.08 µs	177.99 µs
numerl_vec3_cross	2.97 K	337.20 µs	±20.76%	324.80 µs	533.70 µs

Run Time Comparison

Name	IPS	Slower
graphmath_vec3_cross	13.98 K
matrex_vec3_cross	10.76 K	1.3x
numerl_vec3_cross	2.97 K	4.71x

Memory Usage

Name	Average	Factor
graphmath_vec3_cross	46.92 KB
matrex_vec3_cross	250.21 KB	5.33x
numerl_vec3_cross	265.60 KB	5.66x

#Benchmark vec3_dot

Run Time

Name	IPS	Average	Devitation	Median	99th %
graphmath_vec3_dot	23.48 K	42.59 µs	±20.84%	40.84 µs	67.27 µs
matrex_vec3_dot	9.29 K	107.59 µs	±49.69%	97.70 µs	273.42 µs
numerl_vec3_dot	3.64 K	274.99 µs	±25.39%	264.57 µs	624.24 µs
nx_vec3_dot	0.189 K	5289.58 µs	±14.25%	4900.31 µs	8043.57 µs
exla_vec3_dot	0.0567 K	17636.18 µs	±15.06%	17257.14 µs	25854.61 µs

Run Time Comparison

Name	IPS	Slower
graphmath_vec3_dot	23.48 K
matrex_vec3_dot	9.29 K	2.53x
numerl_vec3_dot	3.64 K	6.46x
nx_vec3_dot	0.189 K	124.19x
exla_vec3_dot	0.0567 K	414.08x

Memory Usage

Name	Average	Factor
graphmath_vec3_dot	15.64 KB
matrex_vec3_dot	351.88 KB	22.5x
numerl_vec3_dot	249.79 KB	15.97x
nx_vec3_dot	11308.17 KB	723.0x
exla_vec3_dot	7379.56 KB	471.82x

#Benchmark quat_rotations

Run Time

Name	IPS	Average	Devitation	Median	99th %
graphmath_vec3_rotate	4.22 K	0.24 ms	±25.27%	0.21 ms	0.42 ms
matrex_vec3_rotate	0.89 K	1.12 ms	±37.38%	1.02 ms	3.25 ms

Run Time Comparison

Name	IPS	Slower
graphmath_vec3_rotate	4.22 K
matrex_vec3_rotate	0.89 K	4.75x

Memory Usage

Name	Average	Factor
graphmath_vec3_rotate	0.21 MB
matrex_vec3_rotate	1.90 MB	8.89x

#Benchmark mat4_add

Run Time

Name	IPS	Average	Devitation	Median	99th %
matrex_mat4_add	7.96 K	125.60 µs	±21.26%	123.40 µs	145.80 µs
graphmath_mat4_add	7.56 K	132.20 µs	±25.39%	119.69 µs	206.39 µs
numerl_mat4_add	6.05 K	165.20 µs	±27.22%	178.69 µs	223.69 µs
nx_mat4_add	0.37 K	2711.80 µs	±7.42%	2657.09 µs	3423.86 µs
exla_mat4_add	0.0679 K	14725.47 µs	±6.70%	14674.08 µs	17505.55 µs

Run Time Comparison

Name	IPS	Slower
matrex_mat4_add	7.96 K
graphmath_mat4_add	7.56 K	1.05x
numerl_mat4_add	6.05 K	1.32x
nx_mat4_add	0.37 K	21.59x
exla_mat4_add	0.0679 K	117.24x

Memory Usage

Name	Average	Factor
matrex_mat4_add	179.82 KB
graphmath_mat4_add	148.45 KB	0.83x
numerl_mat4_add	101.45 KB	0.56x
nx_mat4_add	5012.82 KB	27.88x
exla_mat4_add	6656.80 KB	37.02x

#Benchmark mat4_mul

Run Time

Name	IPS	Average	Devitation	Median	99th %
matrex_mat4_mul	9725.33	0.103 ms	±16.67%	0.101 ms	0.156 ms
numerl_mat4_mul	5989.04	0.167 ms	±20.75%	0.174 ms	0.22 ms
graphmath_mat4_mul	976.11	1.02 ms	±20.31%	1.03 ms	1.48 ms
nx_mat4_mul	352.77	2.83 ms	±5.60%	2.85 ms	3.25 ms
ex_mat4_mul	67.07	14.91 ms	±7.39%	14.69 ms	18.76 ms

Run Time Comparison

Name	IPS	Slower
matrex_mat4_mul	9725.33
numerl_mat4_mul	5989.04	1.62x
graphmath_mat4_mul	976.11	9.96x
nx_mat4_mul	352.77	27.57x
ex_mat4_mul	67.07	144.99x

Memory Usage

Name	Average	Factor
matrex_mat4_mul	101.62 KB
numerl_mat4_mul	101.45 KB	1.0x
graphmath_mat4_mul	148.45 KB	1.46x
nx_mat4_mul	5012.82 KB	49.33x
ex_mat4_mul	6656.80 KB	65.51x

#Benchmark vec4_mat4_mul

Run Time

Name	IPS	Average	Devitation	Median	99th %
matrex_vm4_mul	12.84 K	77.90 µs	±20.77%	77.67 µs	104.72 µs
graphmath_vm4_mul	9.11 K	109.80 µs	±17.87%	106.87 µs	138.93 µs
numerl_vm4_mul	5.67 K	176.40 µs	±32.50%	185.72 µs	266.17 µs
nx_vm4_mul	0.23 K	4416.87 µs	±13.26%	4215.49 µs	7294.22 µs
exla_vm4_mul	0.0553 K	18083.56 µs	±17.79%	16789.01 µs	26136.27 µs

Run Time Comparison

Name	IPS	Slower
matrex_vm4_mul	12.84 K
graphmath_vm4_mul	9.11 K	1.41x
numerl_vm4_mul	5.67 K	2.26x
nx_vm4_mul	0.23 K	56.7x
exla_vm4_mul	0.0553 K	232.15x

Memory Usage

Name	Average	Factor
matrex_vm4_mul	172.01 KB
graphmath_vm4_mul	46.92 KB	0.27x
numerl_vm4_mul	148.33 KB	0.86x
nx_vm4_mul	13224.15 KB	76.88x
exla_vm4_mul	7434.05 KB	43.22x

#Benchmark mat4_transpose

Run Time

Name	IPS	Average	Devitation	Median	99th %
graphmath_mat4_transp	45.82 K	21.83 µs	±74.32%	12.70 µs	53.05 µs
matrex_mat4_transp	9.24 K	108.18 µs	±35.03%	128.34 µs	152.43 µs
numerl_mat4_transp	8.02 K	124.65 µs	±31.31%	141.40 µs	169.01 µs
nx_mat4_transp	0.53 K	1883.57 µs	±6.70%	1818.93 µs	2214.72 µs
exla_mat4_transp	0.0662 K	15105.50 µs	±15.89%	13847.66 µs	20533.12 µs

Run Time Comparison

Name	IPS	Slower
graphmath_mat4_transp	45.82 K
matrex_mat4_transp	9.24 K	4.96x
numerl_mat4_transp	8.02 K	5.71x
nx_mat4_transp	0.53 K	86.3x
exla_mat4_transp	0.0662 K	692.09x

Memory Usage

Name	Average	Factor
graphmath_mat4_transp	148.59 KB
matrex_mat4_transp	140.67 KB	0.95x
numerl_mat4_transp	101.45 KB	0.68x
nx_mat4_transp	3863.23 KB	26.0x
exla_mat4_transp	6155.34 KB	41.43x

heads

clone

#Elixir Mathbench

#Results

#Possible confounding factors

#Detailed results

#System

#Configuration

#Benchmark vec3_add

#Benchmark vec3_cross

#Benchmark vec3_dot

#Benchmark quat_rotations

#Benchmark mat4_add

#Benchmark mat4_mul

#Benchmark vec4_mat4_mul

#Benchmark mat4_transpose