I don’t think threads are expensive because we’re all working on async instead. (For one thing, people who work on async I/O in libraries and languages are probably a disjoint set from people who work on threading in OS kernels.)
I’m not a kernel engineer, but my understanding is that
Creating a thread is expensive because it requires a system call, which is because threads are managed by the kernel — otherwise issuing a blocking system call wouldn’t let other threads keep running.
Threads require dedicated stack space, which isn’t growable, so a healthy chunk of address space has to be reserved for each one. (Probably not as big a deal as it used to be in 32-bit.)
You can get around the first with userspace or “green” threads, but now you’ve got all the same problems that async I/O libraries have to solve: allowing your threads to make blocking syscalls without blocking other green threads. In fact what this post’s author is describing sounds exactly like textbook green threads.
You can get around the second problem either by playing weird games with the stack (as Go does) or by using stackless coroutines under the hood (as async/await does.)
Classic threads move the context around, presumably in and out the stack. This is a well known cost that is avoided by green threads. Of course this will require a whole different set of Io prrimitives from the top of your code all the way down the system calls.
Which is to say, you cannot take a program and just replace threads with green threads. You need to change your all IO calls.
And of course, be a bit careful because the sharing vs copying model is not the same.
I think the answer to this post title is ‘yes’ and I think the author is on the right path. Finally someone is talking about the abstractions that we want as programmers and how they could be implemented under the hood.
A bit strange that the author mentions Erlang in the first half of the article, but then ignores completely when talking about a possible solution, given Erlang does exactly what is suggesting.
Async/wait semantics are, in my opinion, a very bad abstraction. The reason being that they are simply intricate and don’t allow straight forward composable code. Recent languages come up with all sorts of tricks to mitigate what is essentially bad API design at its very definition.
If you read the paper by Dijkstra where he introduces P() and V() primitives, you’ll realize we don’t need any more complicated concurrency APIs. Designing APIs as a sub product of the underlying technology is simply bad engineering. It negates the whole notion o programming languages. Which is to provide simpler mental models for the complexity that lives underneath.
I once saw a talk by Joe Armstrong (RIP), one of Erlang’s creators, showing his disbelief in things like node.js: “I don’t understand, run to completion semantics were popular decades ago, until everyone realized they were an horrible idea”. Why are we bringing them back in the 21st century?
AFAIK Erlang just uses green threads. It has some nice stuff around managing those threads and messaging between them (i.e. actors) but as far as threading goes it’s not much different than, say, a JVM or Go, right?
The thing I like about async/await is that it’s coroutines. The thing I like about coroutines is coop scheduling: you can write concurrent non-blocking code without having to deal with the extremes of thread-safety in a SMP system, which are IMHO too hard for mere humans to reason about.
AFAIK Erlang just uses green threads. It has some nice stuff around managing those threads and messaging between them (i.e. actors) but as far as threading goes it’s not
Green processes. No, it’s completely different. Processes are completely self contained and it’s impossible for a process to side effect another one. An Erlang process takes 20 bytes of memory and its essencialy free resesource wise. The processes are also abstractable from the machine where they are executed. The same code can be deployed in a single machine or a cluster.
Check it out. Try it! Elixir ia probably a more familiar approach to these very concepts for most people these days. Elixir’s documentation of the concurrency model is top quality.
The thing I like about async/await is that it’s coroutines. The thing I like about coroutines is coop scheduling: you can write concurrent non-blocking code without having to deal with the extremes of thread-safety in a SMP system, which are IMHO too hard for mere humans to reason about.
A lot of low level stuff leaking to what can be an otherwise straight forward API for the developer to consume. Writing non blocking code is is not a plus, it’s a negative. It negates the very concept of a sequential set of instructions. I.e. a script, the source code for a program.. we are forced to do it when we don’t have other option. Blocking instructions are the canonical for of source code. Honestly, just give it two hours or so and skim through elixir’s documentation till you get to the concurrency part.
I find it so ridiculously simple compared to async Io in most languages, it’s even comical.
I know that stuff, I’ve used Erlang a bit and even ported its runtime to iOS once. The stuff about heaps and side effects is not relevant to a discussion about low level implementation of concurrency.
The stuff about shared-nothing between processes is a double edged sword, btw. Early in my career at Couchbase we had to abandon CouchDB because it wasn’t fast enough, and a lot of that was the overhead of copying data between processes. (The interpreter was also, at the time pretty slow, without any sort of JIT. Yes, Erlang is super scalable, but each process isn’t very fast.)
but as far as threading goes it’s not much different than, say, a JVM or Go, right?
One difference, among others, is that’s scheduling of processes (barring NIFs) are preemptable. That isn’t true in the JVM, where CPU-heavy work can just steal a thread for a while (or forever).
One difference, among others, is that’s scheduling of processes (barring NIFs) are preemptable. That isn’t true in the JVM, where CPU-heavy work can just steal a thread for a while (or forever).
That’s an implementation detail of the runtime, AFAIK Go’s scheduling has been preemptive for a few versions now. You could even say it’s more preemptible than BEAM, because if a BEAM process is inside a NIF that can’t be preempted (and IIRC nifs have to perform their own time accounting)
threading goes it’s not much different than, say, a JVM or Go, right?
For the underlying implementation no, for the interface it couldn’t be more different as erlang processes have individual heaps and don’t share memory in any meaningful way (there’s one or two types which are refcounted on a per-runtime heap, but that’s the exception rather than the rule).
What, in your opinion, would be a correct solution? In practice, just abstract over semaphores and let the runtime of a language deal with concurrency and parallelism? What does Erlang do differently and how?
That is a possibility yes. It would mean less of fine grain control, but such control is the very cause of the problems we are talking about.
Erlang has self contained green processes with zero shared state. The problems discussed in this whole thread are impossible to occur in Erlang by nature of its VM concurrency model design.
You share state by encapsulating Ina process and use message passing to access it.
Erlang has self contained green processes with zero shared state. The problems discussed in this whole thread are impossible to occur in Erlang by nature of its VM concurrency model design.
That’s not totally true. Large, shared objects are ref counted and live in a shared heap. Reference:
Binary terms which are larger than 64 bytes are not stored in process private heap. They are called Refc Binary (Reference Counted Binary) and are stored in a large Shared Heap which is accessible by all processes who have the pointer of that Refc Binaries. That pointer is called ProcBin and is stored in process private heap.
Creating a thread is expensive because it requires a system call, which is because threads are managed by the kernel — otherwise issuing a blocking system call wouldn’t let other threads keep running.
What if we made it so that they don’t? Or we amortized the cost? We expose other things to userland through fast paths, and we have io-uring for async system calls that can be chained together. Or what if we allow pushing programs to the kernel ala ebpf, that way we can reduce, “create a thread, open a file, read the file into a buffer, timeout if it takes too long” into a single system call?
When you’re doing async you’re already performing system calls, generally, so I think it’s reasonable to solve “performing too many” with “allow combining them to perform fewer”, which ebpf + iouring is pushing us towards.
It certainly feels like there should be some live options here that are more performant anyways.
Threads require dedicated stack space, which isn’t growable, so a healthy chunk of address space has to be reserved for each one.
I’m not convinced that this is a real problem. The stack is allocated as far as the kernel is concerned but it’s not mapped in until it’s accessed.
But the kernel is what schedules threads; that’s one of its core features, even microkernels do it. If you create threads the kernel doesn’t manage, you have no choice but to do your own scheduling … I.e. green threads, as I said. That means that when you go into a system call, nothing happens until you return, all your threads are stuck.
I think you’re overlooking what Solaris offered in its M:N style threads, covered in the Solaris Internals book, for example, current while you were still at Apple (and I was at Sun). Thread scheduling can be multi tiered, some kernel threads scheduled by the kernel, on which userspace threads are multiplexed and scheduled in userspace library functionality, for a tldr.
We ultimately ditched it though, having taken it about as far as it was useful to take it. LWPs (the user mode primitive) are now once more 1:1 with kernel threads. M:N threading is ultimately complex and has structural issues in scheduling that you can’t really solve.
Thanks for posting your two comments! I was hoping that by posting this people with deep experience might come out of the woodwork and dissect it a bit.
I think there was a retrospective paper (or blog post or usenet article?) on the problems with Solaris threads. The impression I got was that the kernel interface was still substantially POSIX, which doesn’t provide enough control over blocking (eg, file system ops vs networking) or pre-emption (eg, signals) for userland threads to work straightforwardly well. More recent async runtimes (libuv, golang) mitigate the problems with worker threads for blocking ops and having a more complicated runtime that makes bigger changes to the way userland code interfaces with the kernel than Solaris threads did.
It’s true that the Solaris M:N threading model underpinned full POSIX threads. The threading libraries were part of the operating system, though, co-developed with the kernel portions. Though it is a long time since I’ve looked into the particulars, I believe there were mechanisms to deal with signals and blocking system calls, along with the other complexities of the UNIX process model.
I’m not sure that libuv does anything particularly special with threading. It’s mostly a cross-platform implementation of an event loop that attempts to paper over the differences between event ports, kqueue, epoll, \Device\Afd, etc. It’s otherwise generally just callback-based polling I/O and timers and so on. If you do long-running work in a libuv event handler, I’m pretty sure you’ll hold up other tasks just as much as you would in Javascript; it provides a thread pool abstraction that you’re supposed to push heavy work into, but that doesn’t have access to the event loop. I don’t think libuv does anything structural to mitigate those issues.
The Go runtime has definitely been through several different shifts in its life time, and I believe they are honing in on the kind of M:N threading solution that Solaris ultimately arrived at. In particular, I gather they now do some amount of work stealing, and they try to kick long-running tasks off of other OS threads by sending thread directed signals to those threads to kick them out. Though they have presumably attempted to reduce the amount of state preservation required to switch between goroutines, to be able to asynchronously punt them out mid-compute I have to imagine they’re still preserving quite a lot! They also have their own complexities like garbage collection that POSIX didn’t have to grapple with. It’s still challenging to do as good a job as the kernel is able to do at certain aspects of scheduling and preemption in a 1:1 model, but I agree that abandoning much of the existing ABI and process model has allowed them to try some new ideas!
Vexing that I can’t find the old Solaris 8/9 threading retrospective. I did find a note about problems with “LWP starvation”, where there were runnable userland threads but all the LWPs were blocked in the kernel. Which agrees with my memory of the retrospective paper, that there were a lot of blocking syscalls that the threading library was unable to make nonblocking, eg the kernel APIs were still too POSIXy to help, and userland didn’t have a sufficiently adaptive thread pool to offload blocking calls.
Basically, things like open() might vary in speed a lot, such that much of the time you don’t want to offload it, but sometimes you really need to. You don’t know which it is in advance, so either you pessimize the fast case or risk starvation in the slow case. There’s no facility for open() to return quickly with a handle on a slow pending op. When Solaris m:n threading was introduced with talk about userland/kernel co-development, I thought they had tackled issues like this, but apparently that didn’t happen.
It doesn’t stop the kernel, it stops your user land code.
If you have 1000 greenlets running in one OS thread, and any single one of those greenlets makes a syscall, your whole app stops while that one greenlet waits for the syscall.
You’re not stuck though. You submit a program to the kernel and the kernel yields back to you until the program has completed. Even if you were stuck, why does that matter? The premise is that you can have a million threads already.
I think the confusion here is “if you are creating your own threads” but I’m not suggesting green threads.
I’m confused. The premise here, from the article, is that kernel-based threads are expensive, not that they’re free and you can have millions. If you have millions, why does it matter if they block, as you say? Having system calls be async would just complicate the userspace programming model for no benefit.
Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads such that one can easily use hundreds of thousands of OS threads without negatively impacting performance
should we have instead invested that into improving the performance of OS threads? I think so
I feel like perhaps we’re talking past each other. I am saying that the premise is “we should try to live in a world where threads are free/ such a world is a better world”. You then said “here is why they are expensive” and I am saying “those are not fundamental features of OS threads, just of our existing APIs”. I’m not advocating for green threads - like the author, I’m suggesting that we could make OS threads faster.
Pushing little programs into kernel mode is a cool technique that I think we’re just learning how to use. Reminiscent of the NeWS window system, and originating from basically the same latency argument (now kernel mode switch vs. network). Presumably over the next decade or so it’ll settle down (or fall out of favor).
As snej said, the stack allocations are a lesser problem with 64-bit (actually 48-bit in current CPUs) address space, but still there’s page table overhead (space and time). If you have a million threads, you’ll notice, whereas a million pending I/O completions are not a problem.
System calls can be surprisingly fast. From what I remember, it’s something like 50 nanoseconds to enter and exit the kernel. Maybe a bit more in todays’ post-spectre world, but still on the order of a few tens of atomic operations.
Dedicated stack space might have been a problem in a 32 bit world, but today’s 64 bit machines have a lot of address space to play with.
20 years ago, threads were very expensive, especially on Linux. And then NPTL came along and replaced the poor implementation. But, I guess, the scars live on.
imo one of the interesting questions to ask is not the differences between virtual/green and “OS” threads, but threads and processes. At least on linux, the syscall for spawning a thread (clone3) is essentially a fancier fork. It just has some options for sharing the same virtual memory, thread group, pid, parent pid, etc (among others, like namespaces - I don’t know if it’s possible to use clone3 to spawn a thread in the same address space but a different pid/network namespace for example).
and with the fact that P and E cores are now on millions of consumer/developer devices with Apple’s chips, and some of the fancier threading knobs like audio workgroups on MacOS, the obvious gains of virtual threads with M:N threading instead of deferring to the kernel is getting hazier.
I don’t know a lot of the answers to these questions but ime, creating threads is a lot cheaper than people think, while using green threads is not as much of a win as people think, and part of the trouble of contemporary async runtimes is that benchmarching requires changing architecture which makes measuring differences hard.
Technically, I can still remember one situation in the modern era in which we halted the CPU to issue commands to a peripheral: The 1.44mb floppy drive. Even in Windows NT, the machine would completely lock up every time it hit the floppy disk, IIRC. So technically, I’m kind of wrong? 🤣
Oh, and (IIRC) the reason for that is because the 1.44mb floppy drive didn’t have its own controller (i.e. processor), so the CPU had to control it directly, and timing was important. So: Sorry about the database not responding for a few seconds, and sorry if the clustering software isn’t responding for a few seconds, but I really need to read something off a floppy disk 🤣
Back then, with fragile buses and single core, single CPUs, we had to CLI (clear all interrupts) before doing certain kinds of IO, and so the mouse pointer would freeze (it was handled by an IRQ, i.e. mouse movement would suspend everything else on the CPU and run an interrupt handler which would “unpaint” the old cursor location and paint the new cursor location), etc. The CLI was necessary, because we didn’t have any other dedicated processors handling timing, etc. for most IO operations. I don’t work at that level at all anymore, but the number of dedicated IO processors in a computer (not counting the CPU and GPU) must be quite high now, and the PCIe bus allows devices on the bus to write directly into the buffers (sometimes measured in MBs) without involving what programmers think of as “the CPU” (i.e. the IO to/from RAM buffers is routed completely in hardware, via the “north bridge”, bypassing the OS and “the CPU” altogether).
And it’s those buffers that allow for such good async IO behavior. Input just means “hey, there’s something in the buffer for me to copy out”, and output just means “hey, there’s space in the buffer that I can write to”. All the complex stuff is about how efficiently that can be determined, and what the OS does to support efficient notifications thereof. (But I’m way out of date on this … the last project I worked on at that level was about a dozen years ago, trying to do zero copy optimizations, RDMA, and remote “agent” invocation with 40Gbps IB on dedicated networking cards that had their own Intel embedded Xeons and RAM on the cards themselves for doing the work. Talk about complex.)
Just like we use green threads but then also have synchronous APIs where we bind together all the asynchronous concerns into a synchronous API. It’s flipped from async to sync all the way up the stack.
I think this is what you’re saying, which I agree with: The green thread should be the only one doing blocking IO APIs, because blocking APIs are simpler to reason about and use, and thus the use of blocking APIs produces more reliable code. But if at all possible, below the green threads we’d want async IO APIs the whole way down.
The author of the article was arguing on Reddit that io_uring “doesn’t count” because - according to them - every file IO operation is implemented as a blocking call executed in io_urings worker pool.
Putting aside moving goal posts and bad faith arguing, I was really surprised by this - I had assumed it was “async all the way down”.
I can’t find anything entirely authoritative on this though, but maybe there are complexities making it “async all the way to the driver” because of the page cache? Perhaps the gains are already mostly had by removing the syscall cost?
The traditional way of writing things like Unix filesystems is to write straight-line code and use what is effectively synchronous block IO to reads things off disk like directories or file indirect blocks. It’s effectively impossible to make filesystem code like this into asynchronous code; it would require major rewrites of core parts of filesystems. So you make it ‘asynchronous’ by putting it into its own thread. I think the only way you get an operating system that can do truly asynchronous filesystem IO is to have that as a design goal from the start and refuse to allow anyone to add kernel code with synchronous IO waits, even though this will make filesystems harder to write.
(It’s one of the ironies of life that low level Unix block IO operations are normally asynchronous, but all of the filesystem code used to read things does basically ‘submit IO request; go to sleep until IO request completes’, generally wrapped up in a function or a macro. This isn’t just Linux; it’s a pattern all over multiple Unix kernels, including historical ones.)
There’s an alternate universe where kqueue won the async API “wars” and Linux ported kqueue instead of inventing epoll, and then inventing io_uring because epoll wasn’t good enough.
In this fantasy universe, Darwin, the BSDs, and Linux all support the same async API. POSIX standardizes kqueue. Windows provides a compatibility API. World peace is achieved and all is well.
There are a couple of things that io_uring does that are better than kqueue + AIO (also worth noting that, although Linux implements the POSIX AIO APIs, the implementation is very slow and so people avoid it, whereas on XNU it’s good and on FreeBSD it’s the thing Netflix uses to saturate 200 Gb/s links with TLS):
The first is that io_uring learns lessons from paravirtual devices and provides a mechanism that lets the kernel pull in a batch of things when it has an idle thread, whereas kqueue and lio_listio require a batch to be pushed in at the same time. I’m somewhat uncertain how much this matters in practice. My intuition is that there are some outliers where it makes sense but a lot where it simply doesn’t matter. The amount of work you typically do in the kernel for an I/O op means that the syscall overhead is negligible. The main win is probably a reduction in cognitive load for the programmer. You can just start throwing things at the kernel and let something else figure out whether it needs to be kicked.
The second, and I think this is a major one, is providing isolated namespaces for file descriptors. POSIX has a requirement that new file descriptors must be the lowest unused FD number. This requirement exists because dup2 didn’t exist in early UNIX and this was how the shell did file descriptor redirection. As a side effect, it means that every close must synchronise with any operation that creates a new file descriptor, which causes scalability issues for a lot of workloads that do things like accept-read-write-close in a tight loop (web servers and other things with request-response semantics) on multicore systems. There’s a nice paper from around 2000 showing that having per-thread fd numbers can speed things up.
I can imagine adding a facility somewhat like this without io_uring where you take the highest bit (below the sign bit, fd numbers have to be signed) as a discriminator and split the FDs into pool + ID pairs, where processes can choose how many bits are used per pool and can explicitly request with variants of socket, openat, accept, and so on that the new FD is in a specific pool (and not necessarily the lowest one). That is quite a big and invasive change. In io_uring there’s a simple change to decide whether new FDs are added to the FD table or to something local to the ring.
I think io_uring (fixing the Linux mistake of writing uint64_t when they mean void*) would be a good addition to other operating systems.
There’s an alternate universe where the *NIXes copied IOCP from Windows and thus had good completion based IO. Alas, they went with readiness notification instead for some reason.
Same reason that they didn’t used to have poll(2), only select(2). It’s 80s BSD / Mach with updates as needed to make the shiny products shine better.
To put it another way, Darwin is developed by a company that makes money selling new hardware. io_uring is made by a company (Meta) that makes money not buying new hardware. They both have incentives to improve OS performance but they’re subtly different.
I would say that there will always be at least one operation that cannot be performance asynchronously. It might be a very specific one which is only synchronous in very specific situations with very specific settings, but it is basically guaranteed there is at least one such case.
I’m not sure I have the same intuition, seems that making everything async-able has been a goal of the kernel for awhile, and they’re close, if not there already.
It’s been a couple years since I was that far down the stack, has that initiative stopped short of completion? Is it expected to?
I can’t tell you specifically for the Linux kernel. My point is that there will always be something that can block because of how many syscalls and syscalls options are out there.
The utility of async programming is to let people express complex state machines that are typical in I/O bound code. The performance benefits are, quite frankly, incidental.
Not only would this offer an easier mental model for developers
Would it? I think the complexity would be pushed elsewhere. With asynchronous I/O, you get to make local concurrency reflect local constraints on resources, and receive events for I/O completions etc; with threads, you have to make local concurrency reflect I/O, and deal with locking etc for handling constraints on concurrent access to local resources.
Linux didn’t have good support for threads until the 2.6 release in December 2003.
Fun fact, the first version of Linux threading was implemented by Xavier Leroy, the creator of OCaml. This dude is legendary: https://en.wikipedia.org/wiki/LinuxThreads
You’re at customs and it’s looking bad. They’ve got your Thinkpad and — forgetful baka you are — you forgot to power it off before the flight, meaning the FDE keys were still in RAM.
The officer turns the laptop to face you. Your stomach drops. They’ve unlocked the session, and Rhythmbox is open.
“Kindly explain. Here we found exceeding three years’ playtime, of what exactly?” She pauses to sneer. “Independently produced French pop rock!”
The officer scrolls down the list. There’s almost no cover art, because most of it never got cover art. You always knew your meticulous ID3 tagging would come back to bite you.
She gestures to the screen while nodding at her supervisor. “This one not even Discogs has a match for.” The atmosphere is so frigid you can hear a pin drop, and silently you wish they hadn’t installed pin dropping machines in customs offices after the war.
“Enough. Tell me your excuse.”
“Well. How do I.. I-I was just really into OCaml in the early 2010s, okay?!”
The supervisor’s face lights up, and suddenly you can hardly believe your luck. It’s Xavier Leroy!;;
I think even if there were no other advantages (such as potential greater performance or lower overhead) from asynchronous I/O, explicit forms of it would still be preferable to traditional threading simply because it’s easier to reason about.
This classic post goes into some more detail, but the gist is: traditional threading is simply too difficult to reason about, largely because it allows execution to be suspended/resumed effectively anywhere, at any time, which creates a combinatorial explosion of “what if that happens between these two lines of code” cases you have to account for. Explicit async, on the other hand, explicitly marks the places where code might suspend and resume, which vastly reduces the number of “what if that happens between these lines” cases you need to worry about and take care of. For example, in an async/await approach, await expressions tell you where a piece of code might suspend/resume execution.
traditional threading is simply too difficult to reason about, largely because it allows execution to be suspended/resumed effectively anywhere, at any time, which creates a combinatorial explosion of “what if that happens between these two lines of code” cases you have to account for
That’s not “traditional threading”, the APIs are the same for green threads, with added tricky details to keep in mind. That’s perhaps ‘threading’ in the way it is exposed in major operative systems APIs.
What you mention as a problem is not a problem of threads, but rather or accessing and mutating shared resources as if they would not be in a multi threaded space. You should not do that. Because if you do, of course that problems arises. Protect your shared resources with a lock/semaphore.
If a language or runtime forces you to do this, then the problem you state cannot possibly occur.
Erlang and elixir do something in these lines, they force you to explicitly passed ‘shared data’s as state.
Erlang gets around this by not sharing data between processes, which requires a whole honking lot of memcpy, which is expensive. Pony has a smarter solution, but Pony never really caught on.
What you mention as a problem is not a problem of threads, but rather or accessing and mutating shared resources as if they would not be in a multi threaded space. You should not do that. Because if you do, of course that problems arises. Protect your shared resources with a lock/semaphore.
I think if you read the longer article I linked, you will find these points addressed.
The article makes the assumption that classic threading means “let’s mutate everything without any synchronization”. It states that in the introduction. That is an horrible assumption. Of course you shouldn’t do that. Protect your shared resources with locks! That is the intended way to use it.
I recon one can argue that “why is it possible to do so then”. I guess that’s a valid point. Older languages were not designed with idiot proofing in mind. Arguably, if something is strongly discouraged, maybe there might be little to no reason to even support it.
This occurred to me the other day: I started working on a Node project at work and had to spawn several async tasks to speed up some IO-heavy code, and the fact that JS only ever runs on a single thread makes async much easier to reason about than, say, Rust, where multiple coroutines might be running in parallel over several threads. There’s no synchronization primitives in Node’s standard library because you just don’t need them!
Then hopefully you have a system that doesn’t share mutable data between threads. Something like Rust’s “Send” trait, or Erlang processes. This lends itself to the Actor model, where individual objects are single-threaded but run on separate threads.
That does not stop sharing mutable data between threads. It just prevents data races. You can share data with mutexes. Though I think actor-model-esque stuff is popular in Rust async; sharing by having an actor own something and sending messages to it is a pretty good model.
I think this article is flawed because it presumes the amount of effort that has been put into async would have yield appreciably better performing threads, but it does not cite any evidence for that. Quite to the contrary, people started moving to async explicitly because tons of research into how to make threads cheaper went nowhere.
The author claims that back when C10K was becoming popular threading was immature (and arguably on Linux it was), but there had already been many efforts to make threading cheaper in the industry, most of which had been abandoned because they broke down or were nightmares to maintain. And it is not like people have stopped trying, there continue to be many ideas and experiments. Off the top of my head:
M:N thread
M:1 threads
Scheduler activations
Green threads (which actually have some success in VM languages)
Go split stacks
All of those were attempts to make threads cheaper, either automatically or through explicit programmer control. They all (mostly) failed due to complexity or compatibility with existing code, and I don’t see any evidence we were getting any closer to doing better.
I also think the article does itself a disservice by evaluating threading performance on Linux. MacOS and Windows have taken different approaches than involve de-emphasize the use threads (things like GCD and Swift structured concurrency). The Linux kernel folks don’t really have as much ability control what people do in user space as platform vendors do, so instead did what they always do, focus on optimizing the kernel primitives as much as possible. Not surprisingly they have some of the best threading micro-benchmarks out there, but that actually works against the author since that means he is comparing a best case scenario for modern threading, determining it is too slow anyway, and then lamenting that if people had spent more effort getting it even faster (without any evidence as to why that would work) we might not need async I/O.
The situation with Linux makes me doubtful that simply optimizing a modern 1:1 kernel scheduled threading implementation will cover the gap necessary to replace async, and I see plenty of evidence that existing software ecosystem cannot handle a more fundamental change outside VMs our more isolated environments.
what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient
I think this misses the bigger picture: OS threads are slow because you don’t control the scheduler. It having to make decisions for tasks outside your program slow it down. It having to execute adversarial programs slow it. Concurrent systems generally get a speed up when they take control of their task scheduler.
It’s not anything to do with the scheduler, per se; it’s almost universally the cost of virtualising the increasingly large and complex amount of CPU state so that you can pretend you’re the only one on the CPU. It’s the cost of saving all of your registers, and your floating point registers, and now a bunch of extra AVX crud, and then dealing with the litany of mitigations and other cache detuning steps required to get you off the CPU. Then we have to do everything in reverse to get the next thread onto the CPU. This is partly why scheduling quanta are ideally measured into at least tens of milliseconds, in order to better amortise the cost of dispatch.
To the extent user mode scheduling offers different latency characteristics and performance, it’s because you’re giving some of this up. You’re giving up the security boundaries between processes, for example. You’re likely adjusting your program’s idea of its internal ABI such that you merely don’t have to save all of the CPU state, and can instead focus on an easier subset. Async/await systems with stackless switch points actually take this the furthest, in that they often need to save almost none of the register state, they just need to record where they got up to in the function and potentially some local variables to come back to on resume.
The cost of this, apart from security boundaries, is generally that you’re hiding everything from the kernel, which is best places to do time based preemption of long running tasks. This is why people ultimately begin to reach for work stealing M:N systems, to prevent silos of work being blocked up behind long running tasks, with some set of kernel visible threads sitting idle. It’s difficult and complex because the only way you generally have to induce a time base preemption are thread directed signals, with all of the issues those entail!
CPU. It’s the cost of saving all of your registers, and your floating point registers, and now a bunch of extra AVX crud, and then dealing with the litany of mitigations and other cache detuning steps required to get you off the CPU.
It feels like there’s a bit more to this story. This benchmark:
Shows that, if you pin thread to one core, the context switch cost is the same:
In these runs, I’m seeing 18.19s / 26.91s ≅ 0.68 or a 30% speedup from going async. However, if I pin the threaded version to a single core, the speed advantage of async disappears:
My best explanation is that this benchmark models a situation where the reason for context switch is IO, so every switch is amortized by a syscall, so making the switch itself doesn’t really improve performance.
In other words, the switch improvement only matters for direct IO-less switches, where, e.g. you send something between tasks using in-memory channel, rather than terminating each task (virtual) quanta with a syscall.
I don’t understand how frequently you need such “direct” switches in real applications. On the one hand, I can see a picture where each task lives between read and write anyway, so it doesn’t really matter. On the other hand, perhaps under high load you can wake up a bunch of tasks at the same time, (epoll wait returns a bunch of things at the same time).
In general, I am yet to see a conclusive benchmarks that carefully demonstrates all the relevant effects here. Everything I’ve seen so far goes through several steps of argument/counterargument like
Well, there always is. Modern CPUs are extremely complicated; by virtue of out of order dispatch, register renaming, deep speculative pipelines, complex and increasingly vast cache hierarchies, it is frequently difficult to pin down what the machine is even doing a lot of the time. Stack-based profiling requires that the CPU unwind enough speculation and commit to certain issued instructions so that the timer interrupt handler that fires to measure the stack can observe only things that are definitely true, and not things that might not come to pass. What impact does that have on execution? Who can say!
The only way to know is to take your actual workload and try it out. Synthetic benchmarks are a curiosity and a tool, to be sure, but the proof of the pudding is in the eating, as they say.
if I pin the threaded version to a single core, the speed advantage of async disappears
This is not really surprising to me. How many threads does it create? If they’re contending for a single core, even though they are within the same process, there’s a lot of activity the kernel engages in to spill the user CPU state into privileged stack storage and then restore the state of the target user context. I would venture that most modern OS kernels generally try to avoid internally using, for example, the x86 floating point and extended register state; by being very careful, the kernel can sometimes avoid the need to save and restore the FPU and other wider registers for a user context when merely servicing a system call. If you need to punt between threads, though, all bets are off and you must do the work.
If an async process has around number-of-CPUs threads doing a mixture of polling and actual handler work on an otherwise idle system, it would be conceivable to me that they end up mostly not migrating between CPU cores during execution. They may do a lot of system calls – indeed, more, probably, than the straight line synchronous blocking code, because you also have to poll – but those are just user-kernel-and-back transitions, and are, I believe, cheaper than user-kernel-different-user transitions.
To be clear, I wasn’t actually advocating a global position on threads or async in my post anyway, just trying to talk about some of the trade-offs. I think Rust does a good job with async/await of allowing a user to write ostensibly straight line code (with some caveats) which gets turned into the kind of state machine I would have had to write by hand to use event ports or kqueue or epoll in C. I think Rust also makes it remarkably easy to use actual OS threads when the moment calls for it, what with the Send and Sync markers and a good set of base synchronisation primitives.
I do think that naive libc-level full stack M:N threading, where the only thing that really moves to user mode is the work of dispatching between user threads, is not worth the trouble. The goal of user mode task systems (like what Tokio in Rust provides, or Go, or Erlang, or Node and Javascript) should be at least somewhat substantially focused on ergonomics as much as performance or memory usage or anything else. If you can find a good balance of ergonomics and resource consumption – and, mostly unmentioned here, you can figure out how to debug that software! – then you’re probably onto a good thing.
In the async version, we have these same 500 pipes, but they are managed by 500 async tasks, not 500 pthreads
Yes, but how many OS threads is the executor using to underpin those async tasks? I have observed, for example, that the multi-threaded Tokio runtime creates at least one OS thread for each apparent schedulable entity (which might be one of two or more vertical threads on a hyperthreading CPU) and async tasks can then bounce between them.
If your runtime creates 48 threads, and you then pin them all to the one CPU, you’re presumably going to be punting between OS threads at least every 20 msec if not more frequently based on your pattern of system calls and blocking and so on?
Then I expect the next step will be to figure out what the benchmark is actually measuring! As a contrived, synthetic benchmark, it will be important to examine both the program itself and the underlying OS at a fine-grained level to see what’s actually going on. Merely looking at the aggregate rates obviously isn’t enough to do anything but speculate. I would get started by using DTrace, but I imagine there’s something similar you can do with whatever OS you’re running the benchmark on.
But for now, I remain extremely skeptical of the claims of the form “user-land threads are faster because…” or “kernel-level threads are faster because…”
There are a lot of such claims, and they all have rather elaborate explanations of what goes wrong on, but they just never connect to explainable benchmarks.
It’s not anything to do with the scheduler, per se; it’s almost universally the cost of virtualising the increasingly large and complex amount of CPU state
Saving all relevant general purpose (and fp/control/segment/mask registers, which userspace stackful coroutines wouldnt have to do) takes a few hundred nanos at most.
other cache detuning steps required to get you off the CPU
x86 SMP memory model is that all cores can address memory. Internal details like cache coherency, interconnect, etc. make this work somewhat efficiently. The kernel optimizing task scheduling for cache access is a decision it makes from a global view and isnt strictly necessary:
One can write a userspace multi-threaded work-stealing scheduler where each thread is pinned to its own core and observe not much perf-loss compared to not-pinning & the kernel occasionally keeping worker threads on the same core.
people ultimately begin to reach for work stealing M:N systems, to prevent silos of work being blocked up behind long running tasks
I describe this as controlling latency by controlling the scheduler. “Time limits” are the most opaque method to achieve it (but require preemption as you’ve noted). Other systems achieve this via prioritization and other bounding rules - which is often more efficient given the scheduler knows more about the task’s needs (unlike a kernel <-> process/thread relationship).
I’ve come to like async being the only available method for IO, like how Roc does it. The biggest problems usually come from all kinds of conflict between async and sync IO, and if only one of them is available, everything is much nicer.
Reading this, I’ve been wondering if the author is aware of lightweight/green threads. Moreover, there are implementations which automatically execute operations that will block in OS threads (they’re not well documented but the list doesn’t change often at all).
I’m not that smart, but I have a really hard time reasoning about data safety and deadlocks in multi-threaded systems. Smarter people than me also seem to have a hard time with this.
One of my hot takes is that concurrency safety is a more important & revolutionary property of Rust than memory safety.
Another hot take is that we probably want to architect our systems to have one single-threaded process per core, each using async-io, with some message passing system between them, ideally with capability passing.
I was expecting the title mistake to be about async/await and associated function coloring (not something I think is obviously a mistake, but I’m open to other opinions–say, maybe async/await has been over-applied)–but instead the article appears to be based on a false dichotomy of optimization effort going into either native thread overhead or async infrastructure in programming languages:
Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads such that one can easily use hundreds of thousands of OS threads without negatively impacting performance
Perhaps we can do better than spawning lots of processes and context switching between them, regardless if in the language runtime or the operating system?
People like Jim Gray advocated using pipeline parallelism instead (one process/thread per core) and partitioning/sharding slow stages in the pipeline (scaling a stage by assigning more cores to it). This is also what the LMAX disruptor does, and what database do (for more recent example see Umbra’s morsels).
Given the success of pipelining in manufacturing and hardware, I wonder why nobody seems to be following Jim’s advice and trying to make it easier to do in software (perhaps via new programming languages)?
I suppose, but perhaps we need something without the clunkiness of being a library and the overhead which causes it to be slower than using a single thread when run locally?
I don’t think threads are expensive because we’re all working on async instead. (For one thing, people who work on async I/O in libraries and languages are probably a disjoint set from people who work on threading in OS kernels.)
I’m not a kernel engineer, but my understanding is that
You can get around the first with userspace or “green” threads, but now you’ve got all the same problems that async I/O libraries have to solve: allowing your threads to make blocking syscalls without blocking other green threads. In fact what this post’s author is describing sounds exactly like textbook green threads.
You can get around the second problem either by playing weird games with the stack (as Go does) or by using stackless coroutines under the hood (as async/await does.)
Classic threads move the context around, presumably in and out the stack. This is a well known cost that is avoided by green threads. Of course this will require a whole different set of Io prrimitives from the top of your code all the way down the system calls. Which is to say, you cannot take a program and just replace threads with green threads. You need to change your all IO calls. And of course, be a bit careful because the sharing vs copying model is not the same.
I think the answer to this post title is ‘yes’ and I think the author is on the right path. Finally someone is talking about the abstractions that we want as programmers and how they could be implemented under the hood.
A bit strange that the author mentions Erlang in the first half of the article, but then ignores completely when talking about a possible solution, given Erlang does exactly what is suggesting.
Async/wait semantics are, in my opinion, a very bad abstraction. The reason being that they are simply intricate and don’t allow straight forward composable code. Recent languages come up with all sorts of tricks to mitigate what is essentially bad API design at its very definition. If you read the paper by Dijkstra where he introduces P() and V() primitives, you’ll realize we don’t need any more complicated concurrency APIs. Designing APIs as a sub product of the underlying technology is simply bad engineering. It negates the whole notion o programming languages. Which is to provide simpler mental models for the complexity that lives underneath.
I once saw a talk by Joe Armstrong (RIP), one of Erlang’s creators, showing his disbelief in things like node.js: “I don’t understand, run to completion semantics were popular decades ago, until everyone realized they were an horrible idea”. Why are we bringing them back in the 21st century?
AFAIK Erlang just uses green threads. It has some nice stuff around managing those threads and messaging between them (i.e. actors) but as far as threading goes it’s not much different than, say, a JVM or Go, right?
The thing I like about async/await is that it’s coroutines. The thing I like about coroutines is coop scheduling: you can write concurrent non-blocking code without having to deal with the extremes of thread-safety in a SMP system, which are IMHO too hard for mere humans to reason about.
Green processes. No, it’s completely different. Processes are completely self contained and it’s impossible for a process to side effect another one. An Erlang process takes 20 bytes of memory and its essencialy free resesource wise. The processes are also abstractable from the machine where they are executed. The same code can be deployed in a single machine or a cluster.
Check it out. Try it! Elixir ia probably a more familiar approach to these very concepts for most people these days. Elixir’s documentation of the concurrency model is top quality.
A lot of low level stuff leaking to what can be an otherwise straight forward API for the developer to consume. Writing non blocking code is is not a plus, it’s a negative. It negates the very concept of a sequential set of instructions. I.e. a script, the source code for a program.. we are forced to do it when we don’t have other option. Blocking instructions are the canonical for of source code. Honestly, just give it two hours or so and skim through elixir’s documentation till you get to the concurrency part. I find it so ridiculously simple compared to async Io in most languages, it’s even comical.
I know that stuff, I’ve used Erlang a bit and even ported its runtime to iOS once. The stuff about heaps and side effects is not relevant to a discussion about low level implementation of concurrency.
The stuff about shared-nothing between processes is a double edged sword, btw. Early in my career at Couchbase we had to abandon CouchDB because it wasn’t fast enough, and a lot of that was the overhead of copying data between processes. (The interpreter was also, at the time pretty slow, without any sort of JIT. Yes, Erlang is super scalable, but each process isn’t very fast.)
One difference, among others, is that’s scheduling of processes (barring NIFs) are preemptable. That isn’t true in the JVM, where CPU-heavy work can just steal a thread for a while (or forever).
Erlang/BEAM is pretty interesting and different.
https://mudssrali.com/blog/breakdown-of-scheduling-in-erlang
That’s an implementation detail of the runtime, AFAIK Go’s scheduling has been preemptive for a few versions now. You could even say it’s more preemptible than BEAM, because if a BEAM process is inside a NIF that can’t be preempted (and IIRC nifs have to perform their own time accounting)
For the underlying implementation no, for the interface it couldn’t be more different as erlang processes have individual heaps and don’t share memory in any meaningful way (there’s one or two types which are refcounted on a per-runtime heap, but that’s the exception rather than the rule).
What, in your opinion, would be a correct solution? In practice, just abstract over semaphores and let the runtime of a language deal with concurrency and parallelism? What does Erlang do differently and how?
That is a possibility yes. It would mean less of fine grain control, but such control is the very cause of the problems we are talking about.
Erlang has self contained green processes with zero shared state. The problems discussed in this whole thread are impossible to occur in Erlang by nature of its VM concurrency model design.
You share state by encapsulating Ina process and use message passing to access it.
That’s not totally true. Large, shared objects are ref counted and live in a shared heap. Reference:
They’re probably still immutable and copy on write, so it doesn’t really matter. It’s also a clever idea.
What if we made it so that they don’t? Or we amortized the cost? We expose other things to userland through fast paths, and we have io-uring for async system calls that can be chained together. Or what if we allow pushing programs to the kernel ala ebpf, that way we can reduce, “create a thread, open a file, read the file into a buffer, timeout if it takes too long” into a single system call?
When you’re doing async you’re already performing system calls, generally, so I think it’s reasonable to solve “performing too many” with “allow combining them to perform fewer”, which ebpf + iouring is pushing us towards.
It certainly feels like there should be some live options here that are more performant anyways.
I’m not convinced that this is a real problem. The stack is allocated as far as the kernel is concerned but it’s not mapped in until it’s accessed.
But the kernel is what schedules threads; that’s one of its core features, even microkernels do it. If you create threads the kernel doesn’t manage, you have no choice but to do your own scheduling … I.e. green threads, as I said. That means that when you go into a system call, nothing happens until you return, all your threads are stuck.
I think you’re overlooking what Solaris offered in its M:N style threads, covered in the Solaris Internals book, for example, current while you were still at Apple (and I was at Sun). Thread scheduling can be multi tiered, some kernel threads scheduled by the kernel, on which userspace threads are multiplexed and scheduled in userspace library functionality, for a tldr.
We ultimately ditched it though, having taken it about as far as it was useful to take it. LWPs (the user mode primitive) are now once more 1:1 with kernel threads. M:N threading is ultimately complex and has structural issues in scheduling that you can’t really solve.
Thanks for posting your two comments! I was hoping that by posting this people with deep experience might come out of the woodwork and dissect it a bit.
I think there was a retrospective paper (or blog post or usenet article?) on the problems with Solaris threads. The impression I got was that the kernel interface was still substantially POSIX, which doesn’t provide enough control over blocking (eg, file system ops vs networking) or pre-emption (eg, signals) for userland threads to work straightforwardly well. More recent async runtimes (libuv, golang) mitigate the problems with worker threads for blocking ops and having a more complicated runtime that makes bigger changes to the way userland code interfaces with the kernel than Solaris threads did.
It’s true that the Solaris M:N threading model underpinned full POSIX threads. The threading libraries were part of the operating system, though, co-developed with the kernel portions. Though it is a long time since I’ve looked into the particulars, I believe there were mechanisms to deal with signals and blocking system calls, along with the other complexities of the UNIX process model.
I’m not sure that libuv does anything particularly special with threading. It’s mostly a cross-platform implementation of an event loop that attempts to paper over the differences between event ports, kqueue, epoll, \Device\Afd, etc. It’s otherwise generally just callback-based polling I/O and timers and so on. If you do long-running work in a libuv event handler, I’m pretty sure you’ll hold up other tasks just as much as you would in Javascript; it provides a thread pool abstraction that you’re supposed to push heavy work into, but that doesn’t have access to the event loop. I don’t think libuv does anything structural to mitigate those issues.
The Go runtime has definitely been through several different shifts in its life time, and I believe they are honing in on the kind of M:N threading solution that Solaris ultimately arrived at. In particular, I gather they now do some amount of work stealing, and they try to kick long-running tasks off of other OS threads by sending thread directed signals to those threads to kick them out. Though they have presumably attempted to reduce the amount of state preservation required to switch between goroutines, to be able to asynchronously punt them out mid-compute I have to imagine they’re still preserving quite a lot! They also have their own complexities like garbage collection that POSIX didn’t have to grapple with. It’s still challenging to do as good a job as the kernel is able to do at certain aspects of scheduling and preemption in a 1:1 model, but I agree that abandoning much of the existing ABI and process model has allowed them to try some new ideas!
Vexing that I can’t find the old Solaris 8/9 threading retrospective. I did find a note about problems with “LWP starvation”, where there were runnable userland threads but all the LWPs were blocked in the kernel. Which agrees with my memory of the retrospective paper, that there were a lot of blocking syscalls that the threading library was unable to make nonblocking, eg the kernel APIs were still too POSIXy to help, and userland didn’t have a sufficiently adaptive thread pool to offload blocking calls.
Basically, things like open() might vary in speed a lot, such that much of the time you don’t want to offload it, but sometimes you really need to. You don’t know which it is in advance, so either you pessimize the fast case or risk starvation in the slow case. There’s no facility for open() to return quickly with a handle on a slow pending op. When Solaris m:n threading was introduced with talk about userland/kernel co-development, I thought they had tackled issues like this, but apparently that didn’t happen.
https://a.co/d/fF01pMT
i think it’s still in there even though ostensibly that’s Solaris 10. I don’t have it anymore so can’t check.
Not a kernel engineer as well but how does it stop kernel from scheduling? Why would I have to do scheduling in user land?
It doesn’t stop the kernel, it stops your user land code.
If you have 1000 greenlets running in one OS thread, and any single one of those greenlets makes a syscall, your whole app stops while that one greenlet waits for the syscall.
You’re not stuck though. You submit a program to the kernel and the kernel yields back to you until the program has completed. Even if you were stuck, why does that matter? The premise is that you can have a million threads already.
I think the confusion here is “if you are creating your own threads” but I’m not suggesting green threads.
I’m confused. The premise here, from the article, is that kernel-based threads are expensive, not that they’re free and you can have millions. If you have millions, why does it matter if they block, as you say? Having system calls be async would just complicate the userspace programming model for no benefit.
I feel like perhaps we’re talking past each other. I am saying that the premise is “we should try to live in a world where threads are free/ such a world is a better world”. You then said “here is why they are expensive” and I am saying “those are not fundamental features of OS threads, just of our existing APIs”. I’m not advocating for green threads - like the author, I’m suggesting that we could make OS threads faster.
Pushing little programs into kernel mode is a cool technique that I think we’re just learning how to use. Reminiscent of the NeWS window system, and originating from basically the same latency argument (now kernel mode switch vs. network). Presumably over the next decade or so it’ll settle down (or fall out of favor).
As snej said, the stack allocations are a lesser problem with 64-bit (actually 48-bit in current CPUs) address space, but still there’s page table overhead (space and time). If you have a million threads, you’ll notice, whereas a million pending I/O completions are not a problem.
I don’t think threads are particularly expensive.
System calls can be surprisingly fast. From what I remember, it’s something like 50 nanoseconds to enter and exit the kernel. Maybe a bit more in todays’ post-spectre world, but still on the order of a few tens of atomic operations.
Dedicated stack space might have been a problem in a 32 bit world, but today’s 64 bit machines have a lot of address space to play with.
20 years ago, threads were very expensive, especially on Linux. And then NPTL came along and replaced the poor implementation. But, I guess, the scars live on.
imo one of the interesting questions to ask is not the differences between virtual/green and “OS” threads, but threads and processes. At least on linux, the syscall for spawning a thread (clone3) is essentially a fancier fork. It just has some options for sharing the same virtual memory, thread group, pid, parent pid, etc (among others, like namespaces - I don’t know if it’s possible to use clone3 to spawn a thread in the same address space but a different pid/network namespace for example).
and with the fact that P and E cores are now on millions of consumer/developer devices with Apple’s chips, and some of the fancier threading knobs like audio workgroups on MacOS, the obvious gains of virtual threads with M:N threading instead of deferring to the kernel is getting hazier.
I don’t know a lot of the answers to these questions but ime, creating threads is a lot cheaper than people think, while using green threads is not as much of a win as people think, and part of the trouble of contemporary async runtimes is that benchmarching requires changing architecture which makes measuring differences hard.
Is this true? I’m pretty sure io_uring can do async file (and basically anything else) IO, or at least that that is one of it’s goals
All IO is asynchronous. Synchronous IO is an illusion created to simplify the life of the programmer.
You mean you don’t halt your CPU when you issue commands to a peripheral?
Technically, I can still remember one situation in the modern era in which we halted the CPU to issue commands to a peripheral: The 1.44mb floppy drive. Even in Windows NT, the machine would completely lock up every time it hit the floppy disk, IIRC. So technically, I’m kind of wrong? 🤣
Oh, and (IIRC) the reason for that is because the 1.44mb floppy drive didn’t have its own controller (i.e. processor), so the CPU had to control it directly, and timing was important. So: Sorry about the database not responding for a few seconds, and sorry if the clustering software isn’t responding for a few seconds, but I really need to read something off a floppy disk 🤣
Back then, with fragile buses and single core, single CPUs, we had to CLI (clear all interrupts) before doing certain kinds of IO, and so the mouse pointer would freeze (it was handled by an IRQ, i.e. mouse movement would suspend everything else on the CPU and run an interrupt handler which would “unpaint” the old cursor location and paint the new cursor location), etc. The CLI was necessary, because we didn’t have any other dedicated processors handling timing, etc. for most IO operations. I don’t work at that level at all anymore, but the number of dedicated IO processors in a computer (not counting the CPU and GPU) must be quite high now, and the PCIe bus allows devices on the bus to write directly into the buffers (sometimes measured in MBs) without involving what programmers think of as “the CPU” (i.e. the IO to/from RAM buffers is routed completely in hardware, via the “north bridge”, bypassing the OS and “the CPU” altogether).
And it’s those buffers that allow for such good async IO behavior. Input just means “hey, there’s something in the buffer for me to copy out”, and output just means “hey, there’s space in the buffer that I can write to”. All the complex stuff is about how efficiently that can be determined, and what the OS does to support efficient notifications thereof. (But I’m way out of date on this … the last project I worked on at that level was about a dozen years ago, trying to do zero copy optimizations, RDMA, and remote “agent” invocation with 40Gbps IB on dedicated networking cards that had their own Intel embedded Xeons and RAM on the cards themselves for doing the work. Talk about complex.)
That’s an insightful take.
Just like we use green threads but then also have synchronous APIs where we bind together all the asynchronous concerns into a synchronous API. It’s flipped from async to sync all the way up the stack.
I think this is what you’re saying, which I agree with: The green thread should be the only one doing blocking IO APIs, because blocking APIs are simpler to reason about and use, and thus the use of blocking APIs produces more reliable code. But if at all possible, below the green threads we’d want async IO APIs the whole way down.
The author of the article was arguing on Reddit that io_uring “doesn’t count” because - according to them - every file IO operation is implemented as a blocking call executed in io_urings worker pool.
Putting aside moving goal posts and bad faith arguing, I was really surprised by this - I had assumed it was “async all the way down”.
I can’t find anything entirely authoritative on this though, but maybe there are complexities making it “async all the way to the driver” because of the page cache? Perhaps the gains are already mostly had by removing the syscall cost?
The traditional way of writing things like Unix filesystems is to write straight-line code and use what is effectively synchronous block IO to reads things off disk like directories or file indirect blocks. It’s effectively impossible to make filesystem code like this into asynchronous code; it would require major rewrites of core parts of filesystems. So you make it ‘asynchronous’ by putting it into its own thread. I think the only way you get an operating system that can do truly asynchronous filesystem IO is to have that as a design goal from the start and refuse to allow anyone to add kernel code with synchronous IO waits, even though this will make filesystems harder to write.
(It’s one of the ironies of life that low level Unix block IO operations are normally asynchronous, but all of the filesystem code used to read things does basically ‘submit IO request; go to sleep until IO request completes’, generally wrapped up in a function or a macro. This isn’t just Linux; it’s a pattern all over multiple Unix kernels, including historical ones.)
Linux isn’t internally asynchronous. Though, Windows I’m pretty sure can do async IO internally and has both IOCP and IORING, so.
I believe you’re right. Now why isn’t it available on Darwin, I grumble?
There’s an alternate universe where kqueue won the async API “wars” and Linux ported kqueue instead of inventing epoll, and then inventing io_uring because epoll wasn’t good enough.
In this fantasy universe, Darwin, the BSDs, and Linux all support the same async API. POSIX standardizes kqueue. Windows provides a compatibility API. World peace is achieved and all is well.
Alas.
There are a couple of things that io_uring does that are better than kqueue + AIO (also worth noting that, although Linux implements the POSIX AIO APIs, the implementation is very slow and so people avoid it, whereas on XNU it’s good and on FreeBSD it’s the thing Netflix uses to saturate 200 Gb/s links with TLS):
The first is that io_uring learns lessons from paravirtual devices and provides a mechanism that lets the kernel pull in a batch of things when it has an idle thread, whereas kqueue and lio_listio require a batch to be pushed in at the same time. I’m somewhat uncertain how much this matters in practice. My intuition is that there are some outliers where it makes sense but a lot where it simply doesn’t matter. The amount of work you typically do in the kernel for an I/O op means that the syscall overhead is negligible. The main win is probably a reduction in cognitive load for the programmer. You can just start throwing things at the kernel and let something else figure out whether it needs to be kicked.
The second, and I think this is a major one, is providing isolated namespaces for file descriptors. POSIX has a requirement that new file descriptors must be the lowest unused FD number. This requirement exists because dup2 didn’t exist in early UNIX and this was how the shell did file descriptor redirection. As a side effect, it means that every close must synchronise with any operation that creates a new file descriptor, which causes scalability issues for a lot of workloads that do things like accept-read-write-close in a tight loop (web servers and other things with request-response semantics) on multicore systems. There’s a nice paper from around 2000 showing that having per-thread fd numbers can speed things up.
I can imagine adding a facility somewhat like this without io_uring where you take the highest bit (below the sign bit, fd numbers have to be signed) as a discriminator and split the FDs into pool + ID pairs, where processes can choose how many bits are used per pool and can explicitly request with variants of socket, openat, accept, and so on that the new FD is in a specific pool (and not necessarily the lowest one). That is quite a big and invasive change. In io_uring there’s a simple change to decide whether new FDs are added to the FD table or to something local to the ring.
I think io_uring (fixing the Linux mistake of writing uint64_t when they mean void*) would be a good addition to other operating systems.
There’s an alternate universe where the *NIXes copied IOCP from Windows and thus had good completion based IO. Alas, they went with readiness notification instead for some reason.
Same reason that they didn’t used to have
poll(2)
, onlyselect(2)
. It’s 80s BSD / Mach with updates as needed to make the shiny products shine better.To put it another way, Darwin is developed by a company that makes money selling new hardware. io_uring is made by a company (Meta) that makes money not buying new hardware. They both have incentives to improve OS performance but they’re subtly different.
I would say that there will always be at least one operation that cannot be performance asynchronously. It might be a very specific one which is only synchronous in very specific situations with very specific settings, but it is basically guaranteed there is at least one such case.
I’m not sure I have the same intuition, seems that making everything async-able has been a goal of the kernel for awhile, and they’re close, if not there already.
It’s been a couple years since I was that far down the stack, has that initiative stopped short of completion? Is it expected to?
I can’t tell you specifically for the Linux kernel. My point is that there will always be something that can block because of how many syscalls and syscalls options are out there.
I think this post makes a statement regarding this, https://www.remlab.net/op/nonblock.shtml are you contradicting it here?
That’s from 2014 and talking about POSIX file IO APIs.
io_uring was added in like 2019, and isn’t a POSIX API.
I think that article is just outdated.
[Comment removed by author]
The utility of async programming is to let people express complex state machines that are typical in I/O bound code. The performance benefits are, quite frankly, incidental.
Would it? I think the complexity would be pushed elsewhere. With asynchronous I/O, you get to make local concurrency reflect local constraints on resources, and receive events for I/O completions etc; with threads, you have to make local concurrency reflect I/O, and deal with locking etc for handling constraints on concurrent access to local resources.
Fun fact, the first version of Linux threading was implemented by Xavier Leroy, the creator of OCaml. This dude is legendary: https://en.wikipedia.org/wiki/LinuxThreads
I’m reproducing for posterity here a Fediverse post I once made, now only visible on an handful of instances that still have it in cache. Now or never.
I think even if there were no other advantages (such as potential greater performance or lower overhead) from asynchronous I/O, explicit forms of it would still be preferable to traditional threading simply because it’s easier to reason about.
This classic post goes into some more detail, but the gist is: traditional threading is simply too difficult to reason about, largely because it allows execution to be suspended/resumed effectively anywhere, at any time, which creates a combinatorial explosion of “what if that happens between these two lines of code” cases you have to account for. Explicit async, on the other hand, explicitly marks the places where code might suspend and resume, which vastly reduces the number of “what if that happens between these lines” cases you need to worry about and take care of. For example, in an
async
/await
approach,await
expressions tell you where a piece of code might suspend/resume execution.That’s not “traditional threading”, the APIs are the same for green threads, with added tricky details to keep in mind. That’s perhaps ‘threading’ in the way it is exposed in major operative systems APIs.
What you mention as a problem is not a problem of threads, but rather or accessing and mutating shared resources as if they would not be in a multi threaded space. You should not do that. Because if you do, of course that problems arises. Protect your shared resources with a lock/semaphore. If a language or runtime forces you to do this, then the problem you state cannot possibly occur. Erlang and elixir do something in these lines, they force you to explicitly passed ‘shared data’s as state.
But deadlocks can. Have you ever read The Problem With Threads? Classic paper.
Erlang gets around this by not sharing data between processes, which requires a whole honking lot of memcpy, which is expensive. Pony has a smarter solution, but Pony never really caught on.
I think if you read the longer article I linked, you will find these points addressed.
The article makes the assumption that classic threading means “let’s mutate everything without any synchronization”. It states that in the introduction. That is an horrible assumption. Of course you shouldn’t do that. Protect your shared resources with locks! That is the intended way to use it.
I recon one can argue that “why is it possible to do so then”. I guess that’s a valid point. Older languages were not designed with idiot proofing in mind. Arguably, if something is strongly discouraged, maybe there might be little to no reason to even support it.
This occurred to me the other day: I started working on a Node project at work and had to spawn several async tasks to speed up some IO-heavy code, and the fact that JS only ever runs on a single thread makes async much easier to reason about than, say, Rust, where multiple coroutines might be running in parallel over several threads. There’s no synchronization primitives in Node’s standard library because you just don’t need them!
what if you’re doing explicit async across multiple threads? :p
Then you deserve whatever evil you end up summoning.
Then hopefully you have a system that doesn’t share mutable data between threads. Something like Rust’s “Send” trait, or Erlang processes. This lends itself to the Actor model, where individual objects are single-threaded but run on separate threads.
That does not stop sharing mutable data between threads. It just prevents data races. You can share data with mutexes. Though I think actor-model-esque stuff is popular in Rust async; sharing by having an actor own something and sending messages to it is a pretty good model.
I think this article is flawed because it presumes the amount of effort that has been put into async would have yield appreciably better performing threads, but it does not cite any evidence for that. Quite to the contrary, people started moving to async explicitly because tons of research into how to make threads cheaper went nowhere.
The author claims that back when C10K was becoming popular threading was immature (and arguably on Linux it was), but there had already been many efforts to make threading cheaper in the industry, most of which had been abandoned because they broke down or were nightmares to maintain. And it is not like people have stopped trying, there continue to be many ideas and experiments. Off the top of my head:
All of those were attempts to make threads cheaper, either automatically or through explicit programmer control. They all (mostly) failed due to complexity or compatibility with existing code, and I don’t see any evidence we were getting any closer to doing better.
I also think the article does itself a disservice by evaluating threading performance on Linux. MacOS and Windows have taken different approaches than involve de-emphasize the use threads (things like GCD and Swift structured concurrency). The Linux kernel folks don’t really have as much ability control what people do in user space as platform vendors do, so instead did what they always do, focus on optimizing the kernel primitives as much as possible. Not surprisingly they have some of the best threading micro-benchmarks out there, but that actually works against the author since that means he is comparing a best case scenario for modern threading, determining it is too slow anyway, and then lamenting that if people had spent more effort getting it even faster (without any evidence as to why that would work) we might not need async I/O.
The situation with Linux makes me doubtful that simply optimizing a modern 1:1 kernel scheduled threading implementation will cover the gap necessary to replace async, and I see plenty of evidence that existing software ecosystem cannot handle a more fundamental change outside VMs our more isolated environments.
I keep this GitHub issue discussion bookmarked for whenever discussions like this come up.
I think this misses the bigger picture: OS threads are slow because you don’t control the scheduler. It having to make decisions for tasks outside your program slow it down. It having to execute adversarial programs slow it. Concurrent systems generally get a speed up when they take control of their task scheduler.
It’s not anything to do with the scheduler, per se; it’s almost universally the cost of virtualising the increasingly large and complex amount of CPU state so that you can pretend you’re the only one on the CPU. It’s the cost of saving all of your registers, and your floating point registers, and now a bunch of extra AVX crud, and then dealing with the litany of mitigations and other cache detuning steps required to get you off the CPU. Then we have to do everything in reverse to get the next thread onto the CPU. This is partly why scheduling quanta are ideally measured into at least tens of milliseconds, in order to better amortise the cost of dispatch.
To the extent user mode scheduling offers different latency characteristics and performance, it’s because you’re giving some of this up. You’re giving up the security boundaries between processes, for example. You’re likely adjusting your program’s idea of its internal ABI such that you merely don’t have to save all of the CPU state, and can instead focus on an easier subset. Async/await systems with stackless switch points actually take this the furthest, in that they often need to save almost none of the register state, they just need to record where they got up to in the function and potentially some local variables to come back to on resume.
The cost of this, apart from security boundaries, is generally that you’re hiding everything from the kernel, which is best places to do time based preemption of long running tasks. This is why people ultimately begin to reach for work stealing M:N systems, to prevent silos of work being blocked up behind long running tasks, with some set of kernel visible threads sitting idle. It’s difficult and complex because the only way you generally have to induce a time base preemption are thread directed signals, with all of the issues those entail!
It feels like there’s a bit more to this story. This benchmark:
https://github.com/jimblandy/context-switch?tab=readme-ov-file#measuring-thread-context-switch-time
Shows that, if you pin thread to one core, the context switch cost is the same:
My best explanation is that this benchmark models a situation where the reason for context switch is IO, so every switch is amortized by a syscall, so making the switch itself doesn’t really improve performance.
In other words, the switch improvement only matters for direct IO-less switches, where, e.g. you send something between tasks using in-memory channel, rather than terminating each task (virtual) quanta with a syscall.
I don’t understand how frequently you need such “direct” switches in real applications. On the one hand, I can see a picture where each task lives between
read
andwrite
anyway, so it doesn’t really matter. On the other hand, perhaps under high load you can wake up a bunch of tasks at the same time, (epoll wait returns a bunch of things at the same time).In general, I am yet to see a conclusive benchmarks that carefully demonstrates all the relevant effects here. Everything I’ve seen so far goes through several steps of argument/counterargument like
and they are kinda not hitting the bottom :(
Well, there always is. Modern CPUs are extremely complicated; by virtue of out of order dispatch, register renaming, deep speculative pipelines, complex and increasingly vast cache hierarchies, it is frequently difficult to pin down what the machine is even doing a lot of the time. Stack-based profiling requires that the CPU unwind enough speculation and commit to certain issued instructions so that the timer interrupt handler that fires to measure the stack can observe only things that are definitely true, and not things that might not come to pass. What impact does that have on execution? Who can say!
The only way to know is to take your actual workload and try it out. Synthetic benchmarks are a curiosity and a tool, to be sure, but the proof of the pudding is in the eating, as they say.
This is not really surprising to me. How many threads does it create? If they’re contending for a single core, even though they are within the same process, there’s a lot of activity the kernel engages in to spill the user CPU state into privileged stack storage and then restore the state of the target user context. I would venture that most modern OS kernels generally try to avoid internally using, for example, the x86 floating point and extended register state; by being very careful, the kernel can sometimes avoid the need to save and restore the FPU and other wider registers for a user context when merely servicing a system call. If you need to punt between threads, though, all bets are off and you must do the work.
If an async process has around number-of-CPUs threads doing a mixture of polling and actual handler work on an otherwise idle system, it would be conceivable to me that they end up mostly not migrating between CPU cores during execution. They may do a lot of system calls – indeed, more, probably, than the straight line synchronous blocking code, because you also have to poll – but those are just user-kernel-and-back transitions, and are, I believe, cheaper than user-kernel-different-user transitions.
To be clear, I wasn’t actually advocating a global position on threads or async in my post anyway, just trying to talk about some of the trade-offs. I think Rust does a good job with async/await of allowing a user to write ostensibly straight line code (with some caveats) which gets turned into the kind of state machine I would have had to write by hand to use event ports or kqueue or epoll in C. I think Rust also makes it remarkably easy to use actual OS threads when the moment calls for it, what with the Send and Sync markers and a good set of base synchronisation primitives.
I do think that naive libc-level full stack M:N threading, where the only thing that really moves to user mode is the work of dispatching between user threads, is not worth the trouble. The goal of user mode task systems (like what Tokio in Rust provides, or Go, or Erlang, or Node and Javascript) should be at least somewhat substantially focused on ergonomics as much as performance or memory usage or anything else. If you can find a good balance of ergonomics and resource consumption – and, mostly unmentioned here, you can figure out how to debug that software! – then you’re probably onto a good thing.
That benchmark does punt between the threads, and it is not slower than async.
There are 500 threads, they are connected together through 500 pipes and they are forwarding a single byte across this pipe ring.
In the async version, we have these same 500 pipes, but they are managed by 500 async tasks, not 500 pthreads
Yes, but how many OS threads is the executor using to underpin those async tasks? I have observed, for example, that the multi-threaded Tokio runtime creates at least one OS thread for each apparent schedulable entity (which might be one of two or more vertical threads on a hyperthreading CPU) and async tasks can then bounce between them.
If your runtime creates 48 threads, and you then pin them all to the one CPU, you’re presumably going to be punting between OS threads at least every 20 msec if not more frequently based on your pattern of system calls and blocking and so on?
For async, in this benchmark there’s no difference between using current_thread and work-stealing tokio executors.
Then I expect the next step will be to figure out what the benchmark is actually measuring! As a contrived, synthetic benchmark, it will be important to examine both the program itself and the underlying OS at a fine-grained level to see what’s actually going on. Merely looking at the aggregate rates obviously isn’t enough to do anything but speculate. I would get started by using DTrace, but I imagine there’s something similar you can do with whatever OS you’re running the benchmark on.
I guess I’ll get to that one day!
But for now, I remain extremely skeptical of the claims of the form “user-land threads are faster because…” or “kernel-level threads are faster because…”
There are a lot of such claims, and they all have rather elaborate explanations of what goes wrong on, but they just never connect to explainable benchmarks.
Saving all relevant general purpose (and fp/control/segment/mask registers, which userspace stackful coroutines wouldnt have to do) takes a few hundred nanos at most.
x86 SMP memory model is that all cores can address memory. Internal details like cache coherency, interconnect, etc. make this work somewhat efficiently. The kernel optimizing task scheduling for cache access is a decision it makes from a global view and isnt strictly necessary:
One can write a userspace multi-threaded work-stealing scheduler where each thread is pinned to its own core and observe not much perf-loss compared to not-pinning & the kernel occasionally keeping worker threads on the same core.
I describe this as controlling latency by controlling the scheduler. “Time limits” are the most opaque method to achieve it (but require preemption as you’ve noted). Other systems achieve this via prioritization and other bounding rules - which is often more efficient given the scheduler knows more about the task’s needs (unlike a kernel <-> process/thread relationship).
Something something, HaikuOS?
I’ve come to like async being the only available method for IO, like how Roc does it. The biggest problems usually come from all kinds of conflict between async and sync IO, and if only one of them is available, everything is much nicer.
Reading this, I’ve been wondering if the author is aware of lightweight/green threads. Moreover, there are implementations which automatically execute operations that will block in OS threads (they’re not well documented but the list doesn’t change often at all).
I’m not that smart, but I have a really hard time reasoning about data safety and deadlocks in multi-threaded systems. Smarter people than me also seem to have a hard time with this.
One of my hot takes is that concurrency safety is a more important & revolutionary property of Rust than memory safety.
Another hot take is that we probably want to architect our systems to have one single-threaded process per core, each using async-io, with some message passing system between them, ideally with capability passing.
Ah capability passing. I wish pony were more popular.
I was expecting the title mistake to be about async/await and associated function coloring (not something I think is obviously a mistake, but I’m open to other opinions–say, maybe async/await has been over-applied)–but instead the article appears to be based on a false dichotomy of optimization effort going into either native thread overhead or async infrastructure in programming languages:
Perhaps we can do better than spawning lots of processes and context switching between them, regardless if in the language runtime or the operating system?
People like Jim Gray advocated using pipeline parallelism instead (one process/thread per core) and partitioning/sharding slow stages in the pipeline (scaling a stage by assigning more cores to it). This is also what the LMAX disruptor does, and what database do (for more recent example see Umbra’s morsels).
Given the success of pipelining in manufacturing and hardware, I wonder why nobody seems to be following Jim’s advice and trying to make it easier to do in software (perhaps via new programming languages)?
Somewhat similar to Spark (the big data framework).
I suppose, but perhaps we need something without the clunkiness of being a library and the overhead which causes it to be slower than using a single thread when run locally?