thoughts on exla
better intro
Crude benchmarks for various Elixir vector math libs. There's a fair amount of stuff written about crunching big matrices with neural-net algorithms using Nx and stuff, but the more I look at that stuff the more it becomes clear that it is definitely not what I want. I want to do physics and graphics, which means crunching lots and lots of relatively small vectors and matrices that can change rapidly, not offloading giant chonks of numbers to the GPU and waiting for it to get back to you days later. So I am curious about the speed of doing this gamedev-type numerical stuff on the Erlang VM, how it's affected by different implementation choices, and how the progress of the BEAM VM's JIT has changed it. Erlang is not really designed to be good at numerical stuff, it's designed to be good at putting together and taking apart pointer-heavy data structures, but it's also traditionally kinda faster in general than I expect it to be. I did some very crude benchmarks and was very surprised by the results, so here I'm trying to figure out more about what's going on.
There's also a variety of implementation approaches taken by various vector math libs. Some of them use native data types, some of them wrap BLAS primitives using NIF's and write higher level stuff atop that in Elixir, some of them wrap higher-level C or Rust code in NIF's, etc. So I want to compare these approaches a little.
Disclaimer: These are not good benchmarks. This is a complicated topic with lots of fiddly nuance, and I don't have the patience to go through it in all the detail it would actually take. For example, there are known optimizer bugs in some recent versions of Elixir. I'm just curious about a ballpark view of the landscape. I'm also not going to look at GPU stuff, it adds too many variables.
The test harness used is Benchee.
Building and running:
# Install math deps
sudo apt install libopenblas-dev liblapacke-dev
# Setup ASDF
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.14.0
source ~/.asdf/asdf.fish
asdf plugin add erlang
asdf plugin add elixir
asdf plugin add rebar
asdf install erlang 26.0.2
asdf install elixir 1.15.4-otp-26
asdf install rebar 3.23.0
asdf global rebar 3.23.0
asdf global erlang 26.0.2
asdf global elixir 1.15.4-otp-26
# Actually build the stuff
mix deps.get
iex -S mix
# In iex shell:
ElixirMathbench.run_all()
| Lib | Representation | Computation engine |
|-------------|----------------|--------------------|
| Graphmath | BEAM tuples | BEAM |
| Matrex | BEAM binaries | BLAS |
| numerl | BEAM binaries | BLAS |
| Nx (native) | BEAM binaries | BEAM |
| Nx (EXLA) | ? | XLA/CPU |
| Lib | Performance | Completeness | Ergonomics |
|-------------|-------------|--------------|------------|
| Graphmath | B+ | A- | B |
| Matrex | B | C+ | A- |
| numerl | C | B | B+ |
| Nx (native) | F | C- | A- |
| Nx (EXLA) | F | C- | A- |
The real result of this is the realization that Graphmath is the only lib among them that actually has all the operations relevant for graphics and physics: Cross product, quaternion rotations, stuff like that. And that Benchee is pretty damn slick.
So, if you are doing lots of math to bunches of small vectors, it appears you're almost always best off getting out of the way of the JIT and letting it optimize your math for you. There's a few cases where the native-code libs do significantly better, namely matrix multiplies, so those might be worth investigating further and trying to optimize, but Graphmath's "tuples full of numbers" representation works absurdly well in most cases, compared to the overhead necessary for FFI calls to native functions. ...That said, some very rough and stupid playing with Rust and the glam
crate loooooooooks like it did vec4*mat4 multiplications about 200x faster than Graphmath, which is also pretty impressive.
Possible future work: Big matrices, small matrices packed into big ones, run with and without JIT, test more libs, test against other langs (numpy and rust mainly)
Every benchmark is the given operation applied to a list of 1000 values with Enum.map()
. That has some overhead, but I tried bencharking these libs on single operations (lib/elixir_mathbench_single.ex
) and got pretty similar results. But in a real system there are probably gains to be had by using a denser data structure of some kind or another, let alone things like column-major layouts.
I have not checked whether any of these libraries actually gives correct results. I guess I trust them to get their basics right.
Creating/destroying matrices is not benchmarked; all the creation of inputs is done outside the benchmark measurement functions. However they do generally have to create a new matrix for outputs most of the time, which might be why some things like EXLA are so slow. If something is designed to operate on the GPU, it's going to assume that allocating memory is always slow, and probably not try too hard to make it a super-optimal design.
Benchmark suite executing on the following system:
Operating System | Linux |
---|---|
CPU Information | AMD Ryzen 5 7640U w |
Number of Available Cores | 12 |
Available Memory | 14.94 GB |
Elixir Version | 1.15.4 |
Erlang Version | 26.0.2 |
Benchmark suite executing with the following configuration:
:time | 5 s |
---|---|
:parallel | 1 |
:warmup | 2 s |
Run Time
Name | IPS | Average | Devitation | Median | 99th % |
---|---|---|---|---|---|
graphmath_vec3_add | 26.31 K | 38.00 µs | ±51.24% | 28.41 µs | 97.75 µs |
matrex_vec3_add | 13.64 K | 73.34 µs | ±21.57% | 72.37 µs | 107.58 µs |
numerl_vec3_add | 6.40 K | 156.25 µs | ±23.12% | 175.50 µs | 211.31 µs |
nx_vec3_add | 1.26 K | 793.05 µs | ±22.67% | 754.82 µs | 1854.12 µs |
exla_vec3_add | 0.0714 K | 14014.31 µs | ±10.38% | 13743.83 µs | 22279.93 µs |
Run Time Comparison
Name | IPS | Slower |
---|---|---|
graphmath_vec3_add | 26.31 K | |
matrex_vec3_add | 13.64 K | 1.93x |
numerl_vec3_add | 6.40 K | 4.11x |
nx_vec3_add | 1.26 K | 20.87x |
exla_vec3_add | 0.0714 K | 368.77x |
Memory Usage
Name | Average | Factor |
---|---|---|
graphmath_vec3_add | 46.92 KB | |
matrex_vec3_add | 172.01 KB | 3.67x |
numerl_vec3_add | 140.52 KB | 2.99x |
nx_vec3_add | 1446.76 KB | 30.83x |
exla_vec3_add | 6618.74 KB | 141.06x |
Name | IPS | Average | Devitation | Median | 99th % |
---|---|---|---|---|---|
graphmath_vec3_cross | 13.98 K | 71.54 µs | ±17.46% | 69.61 µs | 91.90 µs |
matrex_vec3_cross | 10.76 K | 92.96 µs | ±25.88% | 91.08 µs | 177.99 µs |
numerl_vec3_cross | 2.97 K | 337.20 µs | ±20.76% | 324.80 µs | 533.70 µs |
Run Time Comparison
Name | IPS | Slower |
---|---|---|
graphmath_vec3_cross | 13.98 K | |
matrex_vec3_cross | 10.76 K | 1.3x |
numerl_vec3_cross | 2.97 K | 4.71x |
Memory Usage
Name | Average | Factor |
---|---|---|
graphmath_vec3_cross | 46.92 KB | |
matrex_vec3_cross | 250.21 KB | 5.33x |
numerl_vec3_cross | 265.60 KB | 5.66x |
Run Time
Name | IPS | Average | Devitation | Median | 99th % |
---|---|---|---|---|---|
graphmath_vec3_dot | 23.48 K | 42.59 µs | ±20.84% | 40.84 µs | 67.27 µs |
matrex_vec3_dot | 9.29 K | 107.59 µs | ±49.69% | 97.70 µs | 273.42 µs |
numerl_vec3_dot | 3.64 K | 274.99 µs | ±25.39% | 264.57 µs | 624.24 µs |
nx_vec3_dot | 0.189 K | 5289.58 µs | ±14.25% | 4900.31 µs | 8043.57 µs |
exla_vec3_dot | 0.0567 K | 17636.18 µs | ±15.06% | 17257.14 µs | 25854.61 µs |
Run Time Comparison
Name | IPS | Slower |
---|---|---|
graphmath_vec3_dot | 23.48 K | |
matrex_vec3_dot | 9.29 K | 2.53x |
numerl_vec3_dot | 3.64 K | 6.46x |
nx_vec3_dot | 0.189 K | 124.19x |
exla_vec3_dot | 0.0567 K | 414.08x |
Memory Usage
Name | Average | Factor |
---|---|---|
graphmath_vec3_dot | 15.64 KB | |
matrex_vec3_dot | 351.88 KB | 22.5x |
numerl_vec3_dot | 249.79 KB | 15.97x |
nx_vec3_dot | 11308.17 KB | 723.0x |
exla_vec3_dot | 7379.56 KB | 471.82x |
Run Time
Name | IPS | Average | Devitation | Median | 99th % |
---|---|---|---|---|---|
graphmath_vec3_rotate | 4.22 K | 0.24 ms | ±25.27% | 0.21 ms | 0.42 ms |
matrex_vec3_rotate | 0.89 K | 1.12 ms | ±37.38% | 1.02 ms | 3.25 ms |
Run Time Comparison
Name | IPS | Slower |
---|---|---|
graphmath_vec3_rotate | 4.22 K | |
matrex_vec3_rotate | 0.89 K | 4.75x |
Memory Usage
Name | Average | Factor |
---|---|---|
graphmath_vec3_rotate | 0.21 MB | |
matrex_vec3_rotate | 1.90 MB | 8.89x |
Run Time
Name | IPS | Average | Devitation | Median | 99th % |
---|---|---|---|---|---|
matrex_mat4_add | 7.96 K | 125.60 µs | ±21.26% | 123.40 µs | 145.80 µs |
graphmath_mat4_add | 7.56 K | 132.20 µs | ±25.39% | 119.69 µs | 206.39 µs |
numerl_mat4_add | 6.05 K | 165.20 µs | ±27.22% | 178.69 µs | 223.69 µs |
nx_mat4_add | 0.37 K | 2711.80 µs | ±7.42% | 2657.09 µs | 3423.86 µs |
exla_mat4_add | 0.0679 K | 14725.47 µs | ±6.70% | 14674.08 µs | 17505.55 µs |
Run Time Comparison
Name | IPS | Slower |
---|---|---|
matrex_mat4_add | 7.96 K | |
graphmath_mat4_add | 7.56 K | 1.05x |
numerl_mat4_add | 6.05 K | 1.32x |
nx_mat4_add | 0.37 K | 21.59x |
exla_mat4_add | 0.0679 K | 117.24x |
Memory Usage
Name | Average | Factor |
---|---|---|
matrex_mat4_add | 179.82 KB | |
graphmath_mat4_add | 148.45 KB | 0.83x |
numerl_mat4_add | 101.45 KB | 0.56x |
nx_mat4_add | 5012.82 KB | 27.88x |
exla_mat4_add | 6656.80 KB | 37.02x |
Run Time
Name | IPS | Average | Devitation | Median | 99th % |
---|---|---|---|---|---|
matrex_mat4_mul | 9725.33 | 0.103 ms | ±16.67% | 0.101 ms | 0.156 ms |
numerl_mat4_mul | 5989.04 | 0.167 ms | ±20.75% | 0.174 ms | 0.22 ms |
graphmath_mat4_mul | 976.11 | 1.02 ms | ±20.31% | 1.03 ms | 1.48 ms |
nx_mat4_mul | 352.77 | 2.83 ms | ±5.60% | 2.85 ms | 3.25 ms |
ex_mat4_mul | 67.07 | 14.91 ms | ±7.39% | 14.69 ms | 18.76 ms |
Run Time Comparison
Name | IPS | Slower |
---|---|---|
matrex_mat4_mul | 9725.33 | |
numerl_mat4_mul | 5989.04 | 1.62x |
graphmath_mat4_mul | 976.11 | 9.96x |
nx_mat4_mul | 352.77 | 27.57x |
ex_mat4_mul | 67.07 | 144.99x |
Memory Usage
Name | Average | Factor |
---|---|---|
matrex_mat4_mul | 101.62 KB | |
numerl_mat4_mul | 101.45 KB | 1.0x |
graphmath_mat4_mul | 148.45 KB | 1.46x |
nx_mat4_mul | 5012.82 KB | 49.33x |
ex_mat4_mul | 6656.80 KB | 65.51x |
Run Time
Name | IPS | Average | Devitation | Median | 99th % |
---|---|---|---|---|---|
matrex_vm4_mul | 12.84 K | 77.90 µs | ±20.77% | 77.67 µs | 104.72 µs |
graphmath_vm4_mul | 9.11 K | 109.80 µs | ±17.87% | 106.87 µs | 138.93 µs |
numerl_vm4_mul | 5.67 K | 176.40 µs | ±32.50% | 185.72 µs | 266.17 µs |
nx_vm4_mul | 0.23 K | 4416.87 µs | ±13.26% | 4215.49 µs | 7294.22 µs |
exla_vm4_mul | 0.0553 K | 18083.56 µs | ±17.79% | 16789.01 µs | 26136.27 µs |
Run Time Comparison
Name | IPS | Slower |
---|---|---|
matrex_vm4_mul | 12.84 K | |
graphmath_vm4_mul | 9.11 K | 1.41x |
numerl_vm4_mul | 5.67 K | 2.26x |
nx_vm4_mul | 0.23 K | 56.7x |
exla_vm4_mul | 0.0553 K | 232.15x |
Memory Usage
Name | Average | Factor |
---|---|---|
matrex_vm4_mul | 172.01 KB | |
graphmath_vm4_mul | 46.92 KB | 0.27x |
numerl_vm4_mul | 148.33 KB | 0.86x |
nx_vm4_mul | 13224.15 KB | 76.88x |
exla_vm4_mul | 7434.05 KB | 43.22x |
Run Time
Name | IPS | Average | Devitation | Median | 99th % |
---|---|---|---|---|---|
graphmath_mat4_transp | 45.82 K | 21.83 µs | ±74.32% | 12.70 µs | 53.05 µs |
matrex_mat4_transp | 9.24 K | 108.18 µs | ±35.03% | 128.34 µs | 152.43 µs |
numerl_mat4_transp | 8.02 K | 124.65 µs | ±31.31% | 141.40 µs | 169.01 µs |
nx_mat4_transp | 0.53 K | 1883.57 µs | ±6.70% | 1818.93 µs | 2214.72 µs |
exla_mat4_transp | 0.0662 K | 15105.50 µs | ±15.89% | 13847.66 µs | 20533.12 µs |
Run Time Comparison
Name | IPS | Slower |
---|---|---|
graphmath_mat4_transp | 45.82 K | |
matrex_mat4_transp | 9.24 K | 4.96x |
numerl_mat4_transp | 8.02 K | 5.71x |
nx_mat4_transp | 0.53 K | 86.3x |
exla_mat4_transp | 0.0662 K | 692.09x |
Memory Usage
Name | Average | Factor |
---|---|---|
graphmath_mat4_transp | 148.59 KB | |
matrex_mat4_transp | 140.67 KB | 0.95x |
numerl_mat4_transp | 101.45 KB | 0.68x |
nx_mat4_transp | 3863.23 KB | 26.0x |
exla_mat4_transp | 6155.34 KB | 41.43x |