OK, I really, REALLY need to expunge my brain of my bias that says “shelling out to another process is expensive, don’t do that” (which I picked up back in the cgi-bin days). I was not expecting the version of this that changes the Node.js app from running the QR code in JavaScript to using spawn('./qr-cli', ["text here"]) to run TWICE as fast as the non-spawning version, 2572 req/sec compared to 1464 req/sec without that spawn.
Apparently a modern computer can span 2500 processes a second without even blinking.
I think it just speaks more to how slow the nodejs code is that a pure function is slower than multiple system calls + that same pure function in another language.
This sounds like a textbook case of the single-threaded nature of JS being a hindrance. In this case shelling out moves the work the JS engine does to something that can be efficiently poll(2)-ed and therefore allow the event loop to continue processing other workloads concurrently. JS is just really not great for computationally intense workloads, nor does it bill itself as such.
Oh that’s interesting - so maybe part of what’s happening here is that using spawn() on a multi-core system is a quick way to get more of those cores involved.
Writing performant node code means racing to get out of the way of the event loop so it can keep working on small async tasks as much as possible - it’s a lot like avoiding the GIL in python, where shelling out instead of doing a long GIL-holding task can often be a win.
It’s not really that, it’s probably more that overhead of spawning a more efficient program and sending the data over IPC is faster than doing the same computation in JS. They’re using multiple interpreter instances to parallelize the workload across all cores, in theory you can max out all cores executing JS code.
But the latency is lower too. If it was the same workload merely parallelised, it’d have higher throughput but the same or higher latency. The work is just done quicker.
The example server used in the article is not a single-threaded NodeJS server implementation:
Regarding the abnormally high memory usage, it’s because I’m running Node.js in “cluster mode”, which spawns 12 processes for each of the 12 CPU cores on my test machine, and each process is a standalone Node.js instance which is why it takes up 1300+ MB of memory even though we have a very simple server. JS is single-threaded so this is what we have to do if we want a Node.js server to make full use of a multi-core CPU.
It’s still single-threaded and what I said still applies, you just have 12 event loops to stall with computationally intensive workloads instead of just one. It mostly boils down to JS just being not a very efficient language.
This sounds like a textbook case of the single-threaded nature of JS being a hindrance. In this case shelling out moves the work the JS engine does to something that can be efficiently poll(2)-ed and therefore allow the event loop to continue processing other workloads concurrently. JS is just really not great for computationally intense workloads, nor does it bill itself as such.
But the linked document also shows that running a WASM function directly inside JS is even faster than shelling out, and that presumably is just as single-threaded as the plain JS solution?
I think that’s correct, so it mostly amounts to overall efficiency gains. WASM, to my knowledge, shares the thread with the JS runtime so it also stalls the even loop. For computations that are not sufficiently intensive, this is probably a net gain because you save a syscall and only really pay FFI costs (ish, I think WASM serializes arguments across boundaries, but I’m not quite sure).
Yes, starting Unix processes isn’t slow, and CGI isn’t inherently slow. The meme to remember is Richard Hipp who starts a fresh C program on literally every connection to sqlite.org, via inetd:
fork and exec of a C / Rust / Zig / D program -- ~ 1ms
(shared libraries can affect this)
awks, shells, Perl 5 – < 10 ms
Python with no deps – < 50 ms
Ruby -- a bit slower to start than Python
node.js with no deps – < 200 ms
Erlang is somewhere in here I think
Python with deps – 500 ms or more
JVM is as slow, or slower
Or to summarize it even more, starting a Python/JS program can be 50x-200x slower than starting a C/Rust/Zig program.
On the other thread we were talking about good latency budgets being 20-50ms per request. Unforunately most interpreted languages eat this up just by starting, so you can’t use CGI with them efficiently.
That’s why I use FastCGI with Python - because the interpreter starts too slowly.
Shells and Perl are suitable for CGI, but the shell language is too weak and insecure for it. (https://www.oilshell.org changes that – it starts in 2 to 3ms now, but should be less with some optimization.)
It’s basically doing random access I/O (the slowest thing your computer can do) proportional to (large constant factor) * (num imports in program) * (length of PYTHONPATH).
This is one of the few things that got slower with Python 3, not faster.
Software latency generally gets worse unless you fight it
When I rewrote my link log https://dotat.at/:/ I reverted from FastCGI to basic CGI, but Rust is fast enough that it can exec(), load tens of thousands of links, and spit out a web page, faster than the old Perl could just spit out a web page.
I think some of that stems from the fact that, back in the heyday of cgi-bin + Perl, those pretty-slow scripts were booting up a fairly slow language interpreter and doing all the dependency imports, for every request.
Also quite possibly opening new database connections, potentially to different machines. Other caches wouldn’t be warm (or you have to explicitly write them to disk). They would also read config files and a variety of other tasks.
Basically starting a process is fast. But many programs take a long time to start.
For a long time, Linux could exec and wait for something like a process every half millisecond single threaded, with some contention as core count goes up. Static linking helps this. Spawning a no-op thread and joining it takes about 25 microseconds.
9front gets something like ten times better, sitting at about 44 microseconds per fork/exec on the same hardware, but its binary format is also a lot simpler. Spawning a thread (well, shared memory process, without exec) takes about 10 microseconds. We serve the main website with a web server written in shell script (https://git.9front.org/plan9front/plan9front/HEAD/rc/bin/rc-httpd/rc-httpd/f.html), and even with all of the processes spawned, it’s fast enough.
Processes and threads are much faster than most people expect.
Very interesting! I think 9front/plan9 use programs at something like /net/tcp/80 where the 80 is an executable that is invoked for each new connection, so spawning new processes quickly is critical to the general architecture.
How is 9fronts modern hardware support? I’d love to poke around with it on like a Raspberry Pi 5 or in a QEMU VM.
Modern hardware support is pretty good, but the pi5 hasn’t had anyone interested enough to port it. A used Thinkpad is cheaper, faster, and comes with a built in screen.
Yep, likely caused by the notarization check if on MacOS. On windows it could be antivirus. On Linux, for example, I see no difference:
$ cargo new --bin rust_hello
Created binary (application) `rust_hello` package
$ cargo build --release
Compiling rust_hello v0.1.0 (/home/markwatson/rust_hello)
Finished release [optimized] target(s) in 0.14s
$ time ./target/release/rust_hello
Hello, world!
real 0m0.017s
user 0m0.013s
sys 0m0.008s
$ time ./target/release/rust_hello
Hello, world!
real 0m0.017s
user 0m0.014s
sys 0m0.008s
(This was running on a slow machine that was running other programs, so while it’s not super realistic, it does seem to illustrate it doesn’t take longer on first launch. I’m too lazy to build a more realistic test.)
There is no expensive runtime there. The compile time of Rust is my least favorite part of Rust for sure. The startup time of binaries is almost zero, although specific programs can still be slow to start if they are doing something expensive on their own.
If the interpreted language allows for parallel execution when launching the executable in Tier 1 (may not be the case for JavaScript like other comments suggest, without workarounds), it will introduce a potential denial of service vulnerability with PID exhaustion.
OK, I really, REALLY need to expunge my brain of my bias that says “shelling out to another process is expensive, don’t do that” (which I picked up back in the cgi-bin days). I was not expecting the version of this that changes the Node.js app from running the QR code in JavaScript to using
spawn('./qr-cli', ["text here"])
to run TWICE as fast as the non-spawning version, 2572 req/sec compared to 1464 req/sec without that spawn.Apparently a modern computer can span 2500 processes a second without even blinking.
I think it just speaks more to how slow the nodejs code is that a pure function is slower than multiple system calls + that same pure function in another language.
This sounds like a textbook case of the single-threaded nature of JS being a hindrance. In this case shelling out moves the work the JS engine does to something that can be efficiently
poll(2)
-ed and therefore allow the event loop to continue processing other workloads concurrently. JS is just really not great for computationally intense workloads, nor does it bill itself as such.Oh that’s interesting - so maybe part of what’s happening here is that using
spawn()
on a multi-core system is a quick way to get more of those cores involved.Writing performant node code means racing to get out of the way of the event loop so it can keep working on small async tasks as much as possible - it’s a lot like avoiding the GIL in python, where shelling out instead of doing a long GIL-holding task can often be a win.
It’s not really that, it’s probably more that overhead of spawning a more efficient program and sending the data over IPC is faster than doing the same computation in JS. They’re using multiple interpreter instances to parallelize the workload across all cores, in theory you can max out all cores executing JS code.
But the latency is lower too. If it was the same workload merely parallelised, it’d have higher throughput but the same or higher latency. The work is just done quicker.
It is not merely parallelized, it’s not actually any more parallelized than it was before (see other comments): it’s simply more efficient.
The example server used in the article is not a single-threaded NodeJS server implementation:
It’s still single-threaded and what I said still applies, you just have 12 event loops to stall with computationally intensive workloads instead of just one. It mostly boils down to JS just being not a very efficient language.
But the linked document also shows that running a WASM function directly inside JS is even faster than shelling out, and that presumably is just as single-threaded as the plain JS solution?
I think that’s correct, so it mostly amounts to overall efficiency gains. WASM, to my knowledge, shares the thread with the JS runtime so it also stalls the even loop. For computations that are not sufficiently intensive, this is probably a net gain because you save a syscall and only really pay FFI costs (ish, I think WASM serializes arguments across boundaries, but I’m not quite sure).
Yes, starting Unix processes isn’t slow, and CGI isn’t inherently slow. The meme to remember is Richard Hipp who starts a fresh C program on literally every connection to sqlite.org, via inetd:
https://www.mail-archive.com/[email protected]/msg02065.html
(this is 2010, but I actually would bet it’s still the same way. We can probably tell by going through the public repos.)
Quoted in Comments on Scripting, CGI, and FastCGI
So CGI is not inherently slow, but it CAN be slow, because starting Python interpreters and node.js JITs is slow.
Some rules of thumb from Reminiscing CGI Scripts:
Or to summarize it even more, starting a Python/JS program can be 50x-200x slower than starting a C/Rust/Zig program.
On the other thread we were talking about good latency budgets being 20-50ms per request. Unforunately most interpreted languages eat this up just by starting, so you can’t use CGI with them efficiently.
That’s why I use FastCGI with Python - because the interpreter starts too slowly.
Shells and Perl are suitable for CGI, but the shell language is too weak and insecure for it. (https://www.oilshell.org changes that – it starts in 2 to 3ms now, but should be less with some optimization.)
Why starting Python interpreters is slow: https://news.ycombinator.com/item?id=7842873
This is one of the few things that got slower with Python 3, not faster.
Software latency generally gets worse unless you fight it
Same point here too - CGI WTF
Yeah that surprised me too!
When I rewrote my link log https://dotat.at/:/ I reverted from FastCGI to basic CGI, but Rust is fast enough that it can exec(), load tens of thousands of links, and spit out a web page, faster than the old Perl could just spit out a web page.
What is a linklog? Is it a log of webpages that link to your website?
It’s a web log of links without writing (tho I adjust titles for style and searchability)
I think some of that stems from the fact that, back in the heyday of cgi-bin + Perl, those pretty-slow scripts were booting up a fairly slow language interpreter and doing all the dependency imports, for every request.
Also quite possibly opening new database connections, potentially to different machines. Other caches wouldn’t be warm (or you have to explicitly write them to disk). They would also read config files and a variety of other tasks.
Basically starting a process is fast. But many programs take a long time to start.
I think this also applies for spawning threads. Modern CPUs can handle hundreds of thousands of threads without squeaking.
The limits on threads have come from kernels rather than CPUs for a very long time.
I assume you mean “memory” not kernels.
Each thread needs a stack and touches at least a page of it. Those 4k pages add up fast.
If that were the case green threads wouldn’t be so much faster. They still need stacks, even if those stacks don’t live on “the” stack.
For a long time, Linux could exec and wait for something like a process every half millisecond single threaded, with some contention as core count goes up. Static linking helps this. Spawning a no-op thread and joining it takes about 25 microseconds.
9front gets something like ten times better, sitting at about 44 microseconds per fork/exec on the same hardware, but its binary format is also a lot simpler. Spawning a thread (well, shared memory process, without exec) takes about 10 microseconds. We serve the main website with a web server written in shell script (https://git.9front.org/plan9front/plan9front/HEAD/rc/bin/rc-httpd/rc-httpd/f.html), and even with all of the processes spawned, it’s fast enough.
Processes and threads are much faster than most people expect.
Very interesting! I think 9front/plan9 use programs at something like
/net/tcp/80
where the80
is an executable that is invoked for each new connection, so spawning new processes quickly is critical to the general architecture.How is 9fronts modern hardware support? I’d love to poke around with it on like a Raspberry Pi 5 or in a QEMU VM.
/bin/service/$proto^$port, but same deal.
Modern hardware support is pretty good, but the pi5 hasn’t had anyone interested enough to port it. A used Thinkpad is cheaper, faster, and comes with a built in screen.
Any idea how hard it would be to port it? I’m interested in doing it if I can figure it out!
Edit: I found the bcm boot stuff. Seems like it wouldn’t be too bad to port (along with downloading the firmware and some
lightdatasheet reading.Doesn’t rust always have a nasty first start for the cli one though? At least every time I’ve rebuilt a rust binary the first boot is always +800ms
My guess is that this is the notarisation check, which involves synchronously waiting for a request to complete the first time you execute an unsigned binary: https://sigpipe.macromates.com/2020/macos-catalina-slow-by-design/
Yep, likely caused by the notarization check if on MacOS. On windows it could be antivirus. On Linux, for example, I see no difference:
(This was running on a slow machine that was running other programs, so while it’s not super realistic, it does seem to illustrate it doesn’t take longer on first launch. I’m too lazy to build a more realistic test.)
It’s really noticeable on macos, even with a tiny C program.
ah yeah that’s gotta be it
Are you including the compile time? There is no inherent boot time for a Rust application. Or at least, no more than C.
Nope, just first execute of binary
That is extremely unusual.
apparently it’s very usual! The notarization check, see above thread
I only use Windows and Linux, and had never heard people complain about it before. Glad you got it figured out though!
There is no expensive runtime there. The compile time of Rust is my least favorite part of Rust for sure. The startup time of binaries is almost zero, although specific programs can still be slow to start if they are doing something expensive on their own.
[Comment removed by author]
Are you on windows? This is something ai never experienced, but it does sound a lot like something defender would do.
mac m3 max
Also building Rust code on a Mac M3. Never seen a “nasty first start”.
[Comment removed by author]
If the interpreted language allows for parallel execution when launching the executable in Tier 1 (may not be the case for JavaScript like other comments suggest, without workarounds), it will introduce a potential denial of service vulnerability with PID exhaustion.