1. 38
  1.  

    1. 39

      OK, I really, REALLY need to expunge my brain of my bias that says “shelling out to another process is expensive, don’t do that” (which I picked up back in the cgi-bin days). I was not expecting the version of this that changes the Node.js app from running the QR code in JavaScript to using spawn('./qr-cli', ["text here"]) to run TWICE as fast as the non-spawning version, 2572 req/sec compared to 1464 req/sec without that spawn.

      Apparently a modern computer can span 2500 processes a second without even blinking.

      1. 42

        I think it just speaks more to how slow the nodejs code is that a pure function is slower than multiple system calls + that same pure function in another language.

      2. 18

        This sounds like a textbook case of the single-threaded nature of JS being a hindrance. In this case shelling out moves the work the JS engine does to something that can be efficiently poll(2)-ed and therefore allow the event loop to continue processing other workloads concurrently. JS is just really not great for computationally intense workloads, nor does it bill itself as such.

        1. 9

          Oh that’s interesting - so maybe part of what’s happening here is that using spawn() on a multi-core system is a quick way to get more of those cores involved.

          1. 7

            Writing performant node code means racing to get out of the way of the event loop so it can keep working on small async tasks as much as possible - it’s a lot like avoiding the GIL in python, where shelling out instead of doing a long GIL-holding task can often be a win.

          2. 4

            It’s not really that, it’s probably more that overhead of spawning a more efficient program and sending the data over IPC is faster than doing the same computation in JS. They’re using multiple interpreter instances to parallelize the workload across all cores, in theory you can max out all cores executing JS code.

        2. 5

          But the latency is lower too. If it was the same workload merely parallelised, it’d have higher throughput but the same or higher latency. The work is just done quicker.

          1. 1

            It is not merely parallelized, it’s not actually any more parallelized than it was before (see other comments): it’s simply more efficient.

        3. 3

          The example server used in the article is not a single-threaded NodeJS server implementation:

          Regarding the abnormally high memory usage, it’s because I’m running Node.js in “cluster mode”, which spawns 12 processes for each of the 12 CPU cores on my test machine, and each process is a standalone Node.js instance which is why it takes up 1300+ MB of memory even though we have a very simple server. JS is single-threaded so this is what we have to do if we want a Node.js server to make full use of a multi-core CPU.

          1. 4

            It’s still single-threaded and what I said still applies, you just have 12 event loops to stall with computationally intensive workloads instead of just one. It mostly boils down to JS just being not a very efficient language.

        4. 2

          This sounds like a textbook case of the single-threaded nature of JS being a hindrance. In this case shelling out moves the work the JS engine does to something that can be efficiently poll(2)-ed and therefore allow the event loop to continue processing other workloads concurrently. JS is just really not great for computationally intense workloads, nor does it bill itself as such.

          But the linked document also shows that running a WASM function directly inside JS is even faster than shelling out, and that presumably is just as single-threaded as the plain JS solution?

          1. 2

            I think that’s correct, so it mostly amounts to overall efficiency gains. WASM, to my knowledge, shares the thread with the JS runtime so it also stalls the even loop. For computations that are not sufficiently intensive, this is probably a net gain because you save a syscall and only really pay FFI costs (ish, I think WASM serializes arguments across boundaries, but I’m not quite sure).

      3. 9

        Yes, starting Unix processes isn’t slow, and CGI isn’t inherently slow. The meme to remember is Richard Hipp who starts a fresh C program on literally every connection to sqlite.org, via inetd:

        https://www.mail-archive.com/[email protected]/msg02065.html

        For each inbound HTTP request, a new process is created which runs the program implemented by the C file shown above.

        (this is 2010, but I actually would bet it’s still the same way. We can probably tell by going through the public repos.)

        Quoted in Comments on Scripting, CGI, and FastCGI


        So CGI is not inherently slow, but it CAN be slow, because starting Python interpreters and node.js JITs is slow.

        Some rules of thumb from Reminiscing CGI Scripts:

        fork and exec of a C / Rust / Zig / D program --  ~ 1ms 
          (shared libraries can affect this)
        awks, shells, Perl 5 – < 10 ms
        Python with no deps – < 50 ms
        Ruby -- a bit slower to start than Python
        node.js with no deps – < 200 ms
            Erlang is somewhere in here I think
        Python with deps – 500 ms or more
            JVM is as slow, or slower
        

        Or to summarize it even more, starting a Python/JS program can be 50x-200x slower than starting a C/Rust/Zig program.

        On the other thread we were talking about good latency budgets being 20-50ms per request. Unforunately most interpreted languages eat this up just by starting, so you can’t use CGI with them efficiently.

        That’s why I use FastCGI with Python - because the interpreter starts too slowly.

        Shells and Perl are suitable for CGI, but the shell language is too weak and insecure for it. (https://www.oilshell.org changes that – it starts in 2 to 3ms now, but should be less with some optimization.)


        Why starting Python interpreters is slow: https://news.ycombinator.com/item?id=7842873

        It’s basically doing random access I/O (the slowest thing your computer can do) proportional to (large constant factor) * (num imports in program) * (length of PYTHONPATH).

        This is one of the few things that got slower with Python 3, not faster.

        Software latency generally gets worse unless you fight it

        1. 2

          Same point here too - CGI WTF

      4. 7

        Yeah that surprised me too!

        When I rewrote my link log https://dotat.at/:/ I reverted from FastCGI to basic CGI, but Rust is fast enough that it can exec(), load tens of thousands of links, and spit out a web page, faster than the old Perl could just spit out a web page.

        1. 1

          What is a linklog? Is it a log of webpages that link to your website?

          1. 2

            It’s a web log of links without writing (tho I adjust titles for style and searchability)

      5. 7

        I think some of that stems from the fact that, back in the heyday of cgi-bin + Perl, those pretty-slow scripts were booting up a fairly slow language interpreter and doing all the dependency imports, for every request.

        1. 1

          Also quite possibly opening new database connections, potentially to different machines. Other caches wouldn’t be warm (or you have to explicitly write them to disk). They would also read config files and a variety of other tasks.

          Basically starting a process is fast. But many programs take a long time to start.

      6. 4

        I think this also applies for spawning threads. Modern CPUs can handle hundreds of thousands of threads without squeaking.

        1. 3

          The limits on threads have come from kernels rather than CPUs for a very long time.

          1. 2

            I assume you mean “memory” not kernels.

            Each thread needs a stack and touches at least a page of it. Those 4k pages add up fast.

            1. 1

              If that were the case green threads wouldn’t be so much faster. They still need stacks, even if those stacks don’t live on “the” stack.

      7. 2

        For a long time, Linux could exec and wait for something like a process every half millisecond single threaded, with some contention as core count goes up. Static linking helps this. Spawning a no-op thread and joining it takes about 25 microseconds.

        9front gets something like ten times better, sitting at about 44 microseconds per fork/exec on the same hardware, but its binary format is also a lot simpler. Spawning a thread (well, shared memory process, without exec) takes about 10 microseconds. We serve the main website with a web server written in shell script (https://git.9front.org/plan9front/plan9front/HEAD/rc/bin/rc-httpd/rc-httpd/f.html), and even with all of the processes spawned, it’s fast enough.

        Processes and threads are much faster than most people expect.

        1. 2

          Very interesting! I think 9front/plan9 use programs at something like /net/tcp/80 where the 80 is an executable that is invoked for each new connection, so spawning new processes quickly is critical to the general architecture.

          How is 9fronts modern hardware support? I’d love to poke around with it on like a Raspberry Pi 5 or in a QEMU VM.

          1. 2

            /bin/service/$proto^$port, but same deal.

            Modern hardware support is pretty good, but the pi5 hasn’t had anyone interested enough to port it. A used Thinkpad is cheaper, faster, and comes with a built in screen.

            1. 1

              Any idea how hard it would be to port it? I’m interested in doing it if I can figure it out!

              Edit: I found the bcm boot stuff. Seems like it wouldn’t be too bad to port (along with downloading the firmware and some light datasheet reading.

    2. 2

      Doesn’t rust always have a nasty first start for the cli one though? At least every time I’ve rebuilt a rust binary the first boot is always +800ms

      1. 24

        My guess is that this is the notarisation check, which involves synchronously waiting for a request to complete the first time you execute an unsigned binary: https://sigpipe.macromates.com/2020/macos-catalina-slow-by-design/

        1. 8

          Yep, likely caused by the notarization check if on MacOS. On windows it could be antivirus. On Linux, for example, I see no difference:

          $ cargo new --bin rust_hello
               Created binary (application) `rust_hello` package
          
          $ cargo build --release
             Compiling rust_hello v0.1.0 (/home/markwatson/rust_hello)
              Finished release [optimized] target(s) in 0.14s
          
          $ time ./target/release/rust_hello
          Hello, world!
          
          real	0m0.017s
          user	0m0.013s
          sys	0m0.008s
          
          $ time ./target/release/rust_hello
          Hello, world!
          
          real	0m0.017s
          user	0m0.014s
          sys	0m0.008s
          

          (This was running on a slow machine that was running other programs, so while it’s not super realistic, it does seem to illustrate it doesn’t take longer on first launch. I’m too lazy to build a more realistic test.)

          1. 4

            It’s really noticeable on macos, even with a tiny C program.

            $ cat welp.c
            #include <stdio.h>
            #include <string.h>
            #include <sys/errno.h>
            
            int main(int argc, char** argv) {
              for (int i = 0; i < 256; i++) {
                char* err = strerror(i);
                printf("%s\n", err);
              }
            }
            $ clang -o welp welp.c
            $ time ./welp > /dev/null
            
            ________________________________________________________
            Executed in  160.93 millis    fish           external
               usr time    1.37 millis   65.00 micros    1.31 millis
               sys time    3.59 millis  908.00 micros    2.68 millis
            
            $ time ./welp > /dev/null
            
            ________________________________________________________
            Executed in    5.43 millis    fish           external
               usr time    1.49 millis    0.20 millis    1.29 millis
               sys time    3.13 millis    1.18 millis    1.96 millis
            
        2. 1

          ah yeah that’s gotta be it

      2. 11

        Are you including the compile time? There is no inherent boot time for a Rust application. Or at least, no more than C.

        1. 2

          Nope, just first execute of binary

          1. 8

            That is extremely unusual.

            1. 1

              apparently it’s very usual! The notarization check, see above thread

              1. 1

                I only use Windows and Linux, and had never heard people complain about it before. Glad you got it figured out though!

          2. 5

            There is no expensive runtime there. The compile time of Rust is my least favorite part of Rust for sure. The startup time of binaries is almost zero, although specific programs can still be slow to start if they are doing something expensive on their own.

          3. [Comment removed by author]

      3. 3

        Are you on windows? This is something ai never experienced, but it does sound a lot like something defender would do.

        1. 1

          mac m3 max

          1. 2

            Also building Rust code on a Mac M3. Never seen a “nasty first start”.

      4. [Comment removed by author]

    3. 1

      If the interpreted language allows for parallel execution when launching the executable in Tier 1 (may not be the case for JavaScript like other comments suggest, without workarounds), it will introduce a potential denial of service vulnerability with PID exhaustion.