The Wayback Machine - https://web.archive.org/web/20250525095958/https://qntm.org/wasm

A small adventure in WebAssembly

We've started playing a new board game. An old board game, actually; it seems like it's originally from 1975 and has gone through a few iterations.

221B Baker Street comes with a deck of 75 unique, original Sherlock Holmes mysteries and a dense book of clues. At the start of the game, one mystery is selected at random from the deck and read out to everybody. Players then start at Baker Street on this board and roll dice to travel around the board to other locations, such as the museum, the theatre, Scotland Yard, the docks and the Boar's Head pub. There are 15 locations in all. At each location, there is a unique clue. When a player reaches the location, they open the clue book, secretly read that single clue, take notes, and then put the book back. The objective is to gather enough clues that you think you can solve the mystery... then swiftly return to Baker Street, announce your solution, and check it against the actual solution to see if you win or lose.

In my opinion, this is a really great adaptation of the Sherlock Holmes premise to a new format. The setting and the mysteries are standalone, timeless and accessible in the same way that the original Sherlock Holmes stories are. You get a new story every time and the way to succeed at the game is to invest personally in that story. There's a luck/skill/judgement element — Do you think you have enough information to guess what really happened? Was eight clues enough? Or should you visit one more location, just in case there's something critical you're missing? But what if someone else beats you back to Baker Street? Can you chance a wild guess? Have you checked the facts? Are you sure?

But what I actually want to talk about is of course the travelling salesman problem.

What is the optimal way to start at Baker Street, visit all 15 locations on the board, gathering all 15 clues, and return to Baker Street?

The game is afoot!

*

There are a few wrinkles here.

First: there are two entrances to the docks, and three entrances to the park. After visiting those locations, you may leave via any exit. Looking at the board, you can see that, for example, the optimal route from Baker Street to Scotland Yard is actually through the park.

Second: fast travel. The carriage depot always has a clue, but it also allows you to travel instantly to any other location without rolling dice. If you don't stop to collect the carriage depot clue, you can even go straight to that other location and get that clue in your current turn. As well as further complicating the shortest route calculations, this means that the graph is asymmetric!

Third: dice. Just measuring step count isn't actually what we want here. The game is played with a single D6 which, being honest, makes navigating the game board quite excruciating. We commonly play with 2D6. And once you arrive at a location, your turn is over, you don't get to keep moving. So we need to optimise for expected number of turns, which isn't exactly the same thing as step count, especially over shorter distances.

All of this means that setting up the correct initial distances for TSP is actually quite tricky in its own right.

But after that, we need some code. I wanted an exact solution, and brute force on 16 cities (16! = 20,922,789,888,000 possible routes) was not practical. So I decided to implement the Held–Karp algorithm.

This was interestingly challenging. My initial implementation, which I won't be sharing, was extremely rough and ready. But it worked. I got an optimal TSP route for n = 16 in about 6.5 seconds. (221B Baker Street, park, hotel, carriage depot, bank, Scotland Yard, Boar's Head via carriage depot, docks, museum, theatre, tobacconist, pawnbroker, locksmith, newsagent, chemist, 221B Baker Street.)

And at this point my interest in the problem kind of split into two separate interests: interest in the game of 221B Baker Street, and interest in the code.

We still play the game. There are plenty of mysteries still to solve. But I actually don't use the optimal route all that often. It makes the game a little too predictable. Also, it commonly involves ignoring information from the mystery. For example, if an actor has been murdered, forget the park, the theatre is the first place to go. (Although, the fastest route from 221B Baker Street to the theatre is via the park and the carriage depot...) In fact, the navigation mechanic is tedious enough that sometimes we even ignore the dice and just take turns magically hopping around the board from location to location, gathering clues. This speeds things up a lot and makes the game more enjoyable, but it renders this project moot. And, of course, it's usually far more advantageous to terminate one's search for clues before visiting all 15 locations...

But separately from this, I wanted to improve my code. I wanted to turn it into a useful, reusable module, not just a rough one-off script. And especially I wanted to work on its performance.

I tinkered with the JavaScript on and off for a while. But while I was doing this, two different thoughts occurred to me. First: JavaScript is not the programming language you choose for high performance work, unless the performance you're looking for is speed of implementation of new UIs. JavaScript is a high-level interpreted language designed for manipulating the DOM. It's fine for that purpose, it's honestly pretty powerful these days and it's much nicer to use now than it was in, say, 2015. And, on the browser side, despite several challengers, it remains the only game in town. But if you have computation to do, and a free choice of programming languages, JavaScript is not the language you choose.

Second: it had been too long since I last learned a new programming language. I have been using JavaScript for server-side and client-side code in my day job, nearly exclusively, for several years now, and I chose it for this project originally because it was the fastest option for me as the implementer. It was time for something new. Something low-level and compiled, with different paradigms. Something which would make me think, and learn. I'd been wanting to learn some Rust for a long time. I will, one day. For this, though, I picked WebAssembly. Handwritten WebAssembly text.

*

I had never touched any assembly language before this. So, this was hugely educational and very difficult! WebAssembly is a massive gear shift. It involved learning how to think about programming all over again. I spent a lot of time in the MDN documentation re-learning how to do exceedingly basic things like adding two variables together. It is an intentionally incredibly sparse language, possibly even relative to other assembly languages. It's also a much less popular language than JavaScript, especially in its handwritten form, which means that guidance and advice is harder to find online.

Just getting anything to work was challenging. I found debugging my code very difficult. You can hook console.log in, but there are no strings, only numbers! You certainly can't just log a chunk of memory out and try to figure out what's wrong with it. I built a battery of unit tests and re-ran them after even the slightest modification to the code.

Thinking hard about performance also became challenging. I actually ported a few learnings from my WebAssembly implementation back to the JavaScript. Most notably, in JavaScript I had been using two 2D (rectangular) arrays len[S][u] and prev[S][u] to store intermediate results. Here, S is a set of cities in the form of an integer bitfield, and u is a city number from 0 to n - 2 inclusive. But WebAssembly of course has no conception of 2D arrays (or even 1D arrays), only contiguous blocks of memory with computed offsets. A major performance jump in my JavaScript implementation came when I started using two large 1D arrays with the same computed offsets: len[(n - 1) * S + u] and prev[(n - 1) * S + u].

I left the less performance-intensive parts of the algorithm as relatively unmodified JavaScript. The main purpose of the WebAssembly is to write a large amount of intermediate results out to memory at high speed. Once this is done, a simple O(n) loop written in JavaScript reads the optimal cycle out from these results — this was not worth attempting to optimise.

My current implementation solves the 221B Baker Street problem, with 16 cities, in around 32.5 milliseconds, or about 200 times faster than my first attempt.

Overall, I had a really good time. It was nice to develop some new programming muscles. You can find both my JavaScript and WebAssembly implementations of Held–Karp here, or install them as an npm package using npm install held-karp.

But there is kind of a mildly surprising ending here.

*

WebAssembly makes more efficient usage of memory, and can compute TSP on one additional city compared with the JavaScript in the same amount of memory. This is as I expected.

But, somehow, it is also a few percent slower than the JavaScript?

At least, it is on my machine. I would be very curious to hear what other people find.

This was actually quite disappointing. I was expecting a pretty substantial performance improvement. That was half of the point of doing this!

There are several possible explanations for this. My WebAssembly is poorly optimised? Seems probable, but I would have hoped that even inefficient WebAssembly would be at least an order of magnitude faster than mere JavaScript. And I would love to learn more about making it faster.

WebAssembly is generally not as efficient as I thought? I heard it was meant to have near-native performance.

Or maybe... Node.js/V8/TurboFan are much more efficient for pure computation tasks than we give them credit for? I said JavaScript isn't a language you choose for performance work. This was common wisdom for a long time. Is it actually still true? What's going on here?

Discussion (7)

2024-11-17 18:08:19 by Simon:

In hindsight the ending may not be that surprising. This kind of code (a small-ish function with tight loops doing many iterations) is the ideal case for a JIT, and V8 has received of a lot of optimization work over the years. I suspect with some tweaks it would be possible to get a JS JIT and a wasm AOT compilers to emit almost the same machine code.

2024-11-17 20:42:51 by mhyee:

What Simon said. There has been a lot of investment from Google, Apple, and Mozilla over the years to make their JavaScript implementations very, very good. Let's take a step back and ask why JavaScript has its reputation for being slow. If we look at an expression like `x + y`, the best case is that we're adding two machine integers, which is just a handful of machine code instructions, maybe even a single instruction. But in JavaScript, we have to assume the worst case. x and y are not machine integers; they're JavaScript values. So when `x + y` in JavaScript is executed, the machine code that runs has to check if x is an integer or a string, check if y is an integer or a string, extract the actual integer from the boxed value, check that + hasn't been overridden, perform the addition, allocate a new JavaScript value on the heap, and then place the result in that value. So just a single "addition" has exploded into many operations that need to happen, which includes branching (not good for instruction pipelining or instruction caches) and memory allocation. And more instructions is also harder to fit into your processor's instruction cache. (The same problems crop up in other dynamic languages like Python and Ruby. But those languages are not as prevalent as JavaScript, so there has been less industry pressure to optimize them. And Python mostly sidesteps this problem because all the performance-critical code uses C libraries, e.g. NumPy.) As you can imagine, a lot of these problems just go away if you write directly in WebAssembly. So, how does V8 do so well? It's a just-in-time compiler, so it can observe and profile the code as it executes, and then emit optimized code. So if there's a loop that looks something like: function f(x, y) { return x + y; } for (let i = 0; i < 1000000; i++) { sum[i] = f(x, y); } The function f gets called a million times. Normally, the code for f would have all the worst case handling: branches to check the types of x and y, unboxing and boxing, and so on. But here, V8 observes that f is always called with integer values, so it can perform speculative optimization: it assumes that future calls will also use integer values, so it emits a version of f that is optimized for integer arguments. And since the arguments are known to be integers, all those checks and unboxing/boxing operations can go away. (I'm handwaving a bit here: it's called speculative optimization because it's based on an assumption. V8 still needs to check that the assumption is valid. If not, it will deoptimize and return to the older, unoptimized version of f. This check is still much faster than going through the unoptimized function every time.) Numeric code in tight loops and small functions is the best case scenario for V8 to reason about and optimize. The code is small and predictable, the values are all numeric and won't suddenly change to strings or objects. Many iterations means V8 has more time to observe the code and then benefit from the optimized implementation. (Aside: there is an old paper, from 2011, Richards et al., "Automated construction of JavaScript benchmarks" which built JavaScript benchmarks based on "real world" JavaScript code from the most visited webpages, and found that most web browsers did not perform well, because real world code is complicated and unpredictable. Yet the JavaScript engines performed quite well on the then-standard JavaScript benchmarks, which were mostly simple and based on numeric computation!) Another thing. I suspect the reason why `len[S][u]` is incredibly slow in JavaScript is because JavaScript doesn't actually have arrays; they're actually dictionaries. So `len[S][u]` is a dictionary of dictionaries, which is going to be involve multiple (slow) dictionary lookups and poor memory locality! By switching to `len[(n - 1) * S + u]`, it removes a level of indirection, which allows the optimizer to turn it into a real array (since all the dictionary keys are contiguous integers). So now we have an array in contiguous memory, which is good for memory locality, caching, and prefetching! One last thing. If the code is easily optimizable, I'd bet on an optimizing compiler over handwritten code. So I'm not that surprised to see JavaScript beat out WebAssembly, in the same way I wouldn't be surprised to see an optimizing C compiler generate better code than handwritten x86.

2024-11-17 21:11:21 by qntm:

On the topic of `len[S][u]`, note that both `S` and `u` are integers, so both `len` and `len[S]` should be simple arrays, with no laborious dictionary lookups. The performance gain probably comes from avoiding a level of indirection and not scattering so many small arrays all over memory.

2024-11-18 04:14:45 by matmarex:

I enjoyed reading about the game and the algorithm! I hate to bear the bad news, but your JS implementation can actually be almost twice as fast as your WASM implementation, not just a couple percent faster, with just two tweaks: avoiding potentially undefined variables and using typed arrays. I sent you a patch instead of trying to fit the code in this text box: https://github.com/qntm/held-karp/pull/3 (I'm not expecting it to be merged, it's just easier to demonstrate this way) I don't know enough about the internals of the engine to say why this is the case. And I don't know almost anything about WASM. You also mentioned that 2D arrays were slow for you. I don't have that version of your code to test it, but I would guess that pre-allocating arrays of the right size (not even typed arrays, even just `new Array(1234)`) would likely make it noticeably faster. The slow part was probably interrupting your tight loop to allocate a bigger array.

2024-11-18 10:01:59 by qntm:

Can't believe I blanked on using a `Float64Array` instead of a regular old `Array`, good shout. Was very surprised to see that just initialising `bestU` and `bestL` to 0 did anything but I can't deny that it did. Thank you! This was cool. I suspect we may now have duelling implementations because some optimisations for the WebAssembly have also been suggested... Back when my implementation used 2D arrays I was indeed using proper sizes for the *inner* arrays, _e.g._ `len[S] = Array(n - 1); prev[S] = Array(n - 1)` and this did indeed make them faster. However, I found that doing this for `len` and `prev` actually made things slower? Same when I changed `len` and `prev` to be 1D arrays. Shrug emoji?

2024-11-18 12:32:24 by Hansel:

JavaScript engines perform well, particularly on algorithms where the JIT compiler can determine that the number of objects and array sizes are constant. As a result, WebAssembly might not offer a significant speed advantage in such cases. Moreover, Rust doesn't generate highly optimized code when compiled to WebAssembly, especially when memory allocations are involved. For better performance, Zig and C or C++ tend to be more efficient, producing smaller WebAssembly binaries as well. Using Rust for web applications can feel like using a sledgehammer to crack a nut—it works, but it may not be the most optimal solution.

2024-11-18 18:05:53 by matmarex:

> However, I found that doing this for `len` and `prev` actually made things slower? Same when I changed `len` and `prev` to be 1D arrays. Shrug emoji? Interesting. I guess if you're accessing the array elements linearly, and this algorithm mostly does, then pre-allocating doesn't have that much of a benefit, and it must have some drawback I don't quite understand. Maybe something doesn't like it when we allocate too big of a chunk all at once? Anyway.

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: