1. 39
  1.  

    1. 30

      Every time I read yet another negative post about async Rust, I feel bad for withoutboats (deliberately not doing an @-mention so as not to alert them if they don’t choose to follow this thread). withoutboats has expressed frustration with the discourse around async Rust, the way they were kicked off the project, and the slow progress since then. I wish I could donate money to some kind of pool to hire withoutboats to properly finish async Rust. I don’t have the means to individually hire someone for a reasonable wage and length of time, but maybe together we can make it happen.

      Meanwhile, I think it would be better to just quietly ignore async Rust if you feel it doesn’t benefit you.

      1. 9

        So it is very frustrating to see the discourse focused on inaccurate statements about async Rust which I believe is the best system for async IO in any language and which just needs to be finished.

        What needs to be done to finish it?


        I also feel the topic of async Rust has turned into a pile-on/circlejerk that’s entirely unproductive. That seems to be the only way that engineers can deal with complex drawn out changes.

        1. 10

          What needs to be done to finish it?

          withoutboats wrote about this.

          1. 4

            I’ll dig into it, but scanning it…

            • It starts with a change to a feature (async generators) that has itself not even landed yet?
            • It’s a ton of language minutiae which does not in any way make it clear to me what my developer experience would be in that future.
            1. 2

              Yeah, read it.

          2. 2

            Dug around to find this page which seems to indicate where the lang team’s head is at with generator syntax.

            I am always a bit sad about not having generators in Rust, given that so much of the language is about being good with holding onto things just long enough to be needed and once you have it, iterators work quite well for so much. Just really tedious to write up simple ones.

      2. 5

        That’s the first I’m hearing he was kicked off the project. I wonder who those friends were that were enough to terminate his involvement? Strange remarks all around and throwing his previous team under the bus like that.

        1. [Comment removed by author]

      3. 4

        Can you, though? It seems like an increasing number of crates have async APIs, often with tokio as a dependency. If the trend continues, writing a “no-async” program may start to feel like writing a no-std program.

      4. 3

        How do I ignore it if I need to use things like gRPC?

        1. 7

          Lazy option: use the gRPC library’s synchronous API in a thread pool.

          Performant option: use gRPC’s completion queue API and (optionally) a bit of unsafe, let the gRPC library handle the async state machine.

        2. 3

          Build your own sync gRPC, with blackjack, etc.

      5. 3

        Some points:

        Async Rust is quite feat of engineering and in a way awesome. No one should be ashamed of it, or take critique of it personally, even when people point out it’s shortcommings.

        It is great that Rust supports async, but the problem is that the ecosystem decided to default to it uncritically assuming “it is strictly better”.

        It would be great if we could “just quietly ignore async Rust” but there are so many cases now where there is no choice because ecosystem provides only async-powered dependency, or where sync version is just a wrapper around async one. Sooo many cases. In my own software I don’t “ignore async Rust”, but use it along side blocking Rust, tactically, in threads where benefits of async outweigth the downsides. But >90% of the code is better off as blocking Rust. And a typical webserver doesn’t need async, so why there is no more well supported blocking-IO web servers anymore? I don’t think such a nuanced approach is getting through to the the community at large, and I wish the official channels would steer and educate the users that blocking IO Rust is OK and should be the default.

        The burnout and slow down in development of Rust is in part the effect of async’s complexity and extra work that it creates, IMO. If the ecosystem wasn’t async-default, shortcommings of async would not be so important and it would be easier to just “take the time” to solve them.

        I also agree with withoutboats that would be better if Rust just follow through with making async as “done as complete” as we can and be done with it, especially that we’re already in a “default-async Rust” situation.

        1. 4

          I don’t understand why you can’t just ignore async Rust. Just block on the future in place and move on. This is how I worked with async libraries pre async-away. I just called .wait().

          1. 1

            Show me a first class non-async web framework with good community support. None left. Rouille is the closest thing from the past, but I never liked the API, it is not popular or widly used. Astra had a good idea, but developer seemed to not have enough time (Edit: I checked and seems like 2 weeks ago there was a new release, so maybe not all hope is lost). The problem is that no one wants to run blocking-IO anymore. Got to be web-scale of course, for that that 50 requests per day blog.

            My little cli tool needs to support download something from an s3 bucket, etc.? Got to add 1M to the binary size, because can’t get aws sdk without async.

            You’d like to stream-decompress and tar.gz archive from an s3 bucket to avoid temporary files? Good luck “just adding .wait", as now you need to deal with async vs blocking IO composition.

            And so on. I practically never need a ultra-high-performance networking service where async IO would make a noticable difference. Faster compilation, smaller binary sizes, better language support usually would be way more valuable than async, but seems like for most people benchmark numbers is all that matters.

            1. 2

              Can you explain why .wait() would not work? Or the spawn_blocking equivalent?

            2. 2

              Show me a first class non-async web framework with good community support. None left. [..] The problem is that no one wants to run blocking-IO anymore. Got to be web-scale of course, for that that 50 requests per day blog.

              The “problem” is that no one who wants to put a lot of effort into creating a web framework that isn’t async, presumably because the overlap between people who want a web framework, can make a web framework, and don’t want to use async, is close to zero. If the community of people who don’t want to use async want more non-async libraries, they should write them.

              You’d like to stream-decompress and tar.gz archive from an s3 bucket to avoid temporary files? Good luck “just adding .wait”, as now you need to deal with async vs blocking IO composition.

              No? Every async runtime that doesn’t use async file IO handles the blocking IO on a threadpool for you.

          2. 1

            There’s a little bit of a virality situation in Rust ecosystem because of async. For many libraries that involve IO the “best” libraries are either async-only or async-first. If you’re writing web services doubly so. I like async so I don’t mind it, but I’m sympathetic to people who don’t want to sue async but have to put up with it because they don’t have another choice.

            1. 5

              But you can just block on any future you run into, right? I guess if you’re using a web framework that is async you’ll run into issues there, that’s the issue?

              1. 1

                But that’s not just “ignoring” async, is it? There’s often some or the other complication: you either need to pull in a runtime (or at least futures) to block on a future. I haven’t done this myself because I don’t mind async, but if you have to do a multi-step async process I can imagine having to write the boilerplate to block on all of it can get tiresome.

    2. 29

      If you are a hyperscaler, you are not using async/await. For a hyperscaler, the cost of managing their server infrastructure may be in the billions. Async/await is an abstraction. You’ll not want to use it. To really get the most out of your performance, you are better of modifying the kernel to your needs.

      I don’t think these people are using async/await, and for good reasons.

      I obviously can’t speak for all the hyperscalers, but a lot of folks at AWS sure are using async/await, increasingly alongside io_uring to great effect. There’s always money in performance, of course, but one of the great things about the Rust ecosystem is how easy it makes it to get to “really very good” performance.

      As a concrete example, when we were building AWS Lambda’s container cache, we build a prototype with hyper and reqwest and tokio talking regular HTTP. We totally expected to need to replace it with something more custom, but found that even the prototype was saturating 50Gb/s NICs and hitting our latency and tail latency targets, so just left them as-is.

      In reality, your OS scheduler does the exact same thing, and probably better.

      I think the reality is that, for high-performance high-request-rate applications, a custom scheduler can do much better than Linux’s general purpose scheduler without a whole lot of effort (mostly because it just knows more about the workload and its goals).

      You have to be (1) working at a large organization, (2) be working on a custom web server (3) that is highly I/O bound.

      This doesn’t seem right either.

      1. 2

        Do you think you would have been able to saturate the NIC using the traditional OS threads model (no async/await)?

        1. 16

          Yeah, for sure. But the async model made the code clearer and simpler then the equivalent threaded code would have been, when taking everything into account (especially the need to avoid metastability problems under overload, which generally precludes a naive thread-per-request implementation).

          1. 1

            I do appreciate the tail latency angle. There are some (synthetic) benchmarks that show async/await being superior here. (Of course this depends on the async runtime too.) On the other hand, it seems to me a bit too niche requirement to justify async/await, assuming overload is not a very common situation. I am assuming async/await being a worse experience here. And reading from your comment you did not have that experience.

            1. 8

              Services of all sizes need to protect against overload. My understanding is that async/await enables the server to handle as much load as the CPU(s) can handle without having to tune arbitrary numbers (e.g. thread pool size), and allowing the network stack to apply natural backpressure if the load increases beyond that point.

              Edit to add: If it seems that designing a service to gracefully handle or prevent overload is a niche concern, perhaps that’s because we tend to throw more hardware at our services than they really ought to need. Or maybe you’ve been lucky enough that you haven’t yet made a mistake in your client software that caused an avoidable overload of your service. I’ve done that, working on a SaaS application for a tiny company.

              1. 3

                When using tokio (and this goes for most async runtimes) it is actually not recommended to use async/await for CPU-bound workloads. The docs recommend using spawn_blocking which ends up in a thread pool with a fixed size: https://docs.rs/tokio/latest/tokio/#cpu-bound-tasks-and-blocking-code

                1. 9

                  When using tokio (and this goes for most async runtimes) it is actually not recommended to use async/await for CPU-bound workloads.

                  No, that’s about hogging up a bunch of CPU time without yielding. It doesn’t apply if you have tasks that use a lot of CPU in total but yield often, or simply have so many tasks that you saturate your CPU. I’m pretty sure the latter is what @mwcampbell was referring to with “enables the server to handle as much load as the CPU(s) can handle”.

            2. 6

              Tail latency is extremely important and undervalued. This is why GC languages are unpopular in the limit, for example — managing tail latencies under memory pressure is very difficult.

              edit: of all groups of lay people, I think gamers have come to understand this the best. Gamers are quite rightly obsessed with what they call “1% lows” and “0.1% lows”.

      2. 1

        As an AWS user, I can say you can saturate S3 get objects calls with async await pretty easily as well, to the point where there’s a few github issues about it. https://github.com/awslabs/aws-sdk-rust/issues/1136 <– essentially you have to hold your concurrency to between 50-200 depending on where you’re situated wrt s3.

    3. 17

      Like many (most?) posts that are labelled as being against async/await in Rust, this one seems to actually be against Tokio:

      In the default tokio configuration, the async runtime will schedule tasks across many threads, to maximize performance. […] Even worse, you now need to choose between std::sync::Mutex and tokio::sync::Mutex. Picking the wrong one could adversely affect performance.

      and

      Standard Rust threads can be “scoped”. Tokio tasks do not support it. This one is also likely never going to be fixed, since there is a fundamental issue making it impossible.

      and

      In async Rust, [closing a temp file via RAII] is not possible since tokio::fs::remove_file must be called from an async function,

      and

      Did you know Tokio will sometimes put your tasks on the heap when it believes they are too big?

      and

      Anytime a library adopts async/await, all its consumers must adopt it too. Aysnc/await poisons across project boundaries.

      Tokio! Tokio! And more Tokio!


      Reading slightly between the lines, the author seems to have started out with the assumption that they need an N:M userspace thread runtime for good I/O performance (because of a background in C# ?), then they found Tokio (an N:M userspace thread library that uses async as one part of its implementation), and they’re having trouble getting Tokio to deliver on what the author expects of it.

      Maybe that’s the author’s fault, maybe it’s Tokio’s fault; I haven’t looked at the author’s code and therefore can’t judge either way. But it seems clear that it’s not Rust‘s fault, because as the author notes the async/await model works great for (1) embedded environments and (2) multiplexing I/O operations on a single thread, which are exactly the use cases that Rust’s async/await is designed to solve.

      Maybe the answer is for the Rust project to intentionally de-emphasize Tokio in its async/await documentation? Over and over it seems that every time I try to use Tokio in my own projects I stumble into weird and non-Rustic behavior (e.g. the implicit spilling to the heap mentioned in this post), and every time I see an experienced programmer struggling with async/await the problems all seem to revolve around Tokio in some capacity.

      1. 11

        Maybe the answer is for the Rust project to intentionally de-emphasize Tokio in its async/await documentation?

        I doubt that would meaningfully help.

        From my perspective, it’s down to a “Nobody ever got fired for choosing IBM” attitude around Tokio, stemming from “I can trust that every dependency I might need supports Tokio. I don’t want to slam face-first into the hazard of some required dependency not supporting async-std or smol or what have you”.

        I think we’re just going to have to await 😜 more things like “async fn in traits” (Rust 1.75) landing as building blocks for looser coupling between dependencies and runtimes.

        (Also, Ugh. Lobste.rs apparently doesn’t have an onbeforeunload handler for un-submitted posts and I accidentally closed the tab containing the previous draft of this after getting too comfortable to compose it in a separate text editor and then paste it over.)

      2. 7

        Reading slightly between the lines, the author seems to have started out with the assumption that they need an N:M userspace thread runtime for good I/O performance (because of a background in C# ?), then they found Tokio

        No, it’s because the existing library ecosystem is centered around tokio. If you’re not writing things yourself, you’re probably going to be using tokio. It’s the unofficial official Rust runtime.

        1. 5

          If you’re not writing things yourself, you’re probably going to be using tokio. It’s the unofficial official Rust runtime.

          I’ve heard this a lot, but it just doesn’t seem to be true – Tokio is popular but by no means universal, and it’s silly to act as if it’s somehow more official than (for example) embassy or glommio.

          Most Rust libraries are written for synchronous operation. It’s a small minority that use async at all, and even fewer of those hardcode the async runtime to Tokio. Not to mention that the use cases for which Rust has a clear advantage over Go/Java/C#/etc are things like embedded, WebAssembly, or dynamic libraries – none of which are supported by heavy N:M runtimes such as Tokio.

      3. 5

        To me tokio is a bunch of decisions made for me. At first when I saw it I disagreed with most of them in some way or another. After I couldn’t avoid it, I realized in the end this isn’t a half bad way of providing parallelism and for instance the alternative to “spilling to the heap” is essentially crashing.

        What I think is scary about tokio for new users, and I’m planning a little post on, is that if you start going deeply async you end up unbounded, possibly with an explosion in the number of tasks depending on how your code is written. You can hit external limits, etc. Controlling that can only be done (afaict) with a tuned Arc limit from the root passing down to the overspawned task. To me it’s a small price for how easy writing and maintaining it is.

      4. 3

        Some of the issues are Tokio specific, some are not. Either way thinking about async/await on purely a language-level is not helpful. Everyone has to pick a runtime and due to lock-in a majority will end up with tokio. Whether the issue stems from Tokio or Rust’s language design ultimately does not matter to me as a programmer that wants my code to work.

        1. 7

          But you frame the article as a general critique against async/await, saying htat most of what you say should apply even to other languages!


          So how about the web servers that run the web right now? Interestingly enough, nginx is written in C, and does not use async/await. Same for Apache. Those two servers happen to be the most widely used web servers and together serve two thirds of all web traffic. Both of them do use non-blocking I/O, but they do not use async/await. They seem to do pretty well regardless. (Note that my beef is specifically with async/await, not with non-blocking I/O.)

          [..]

          If you are a hyperscaler, you are not using async/await. For a hyperscaler, the cost of managing their server infrastructure may be in the billions. Async/await is an abstraction. You’ll not want to use it.

          Are you aware that one hyperscaler, Cloudflare, replaced Nginx with an in-house proxy written in Rust using Tokio, in part due to issues with tail latencies and uneven load balancing between cores?

          1. 7

            AWS’s core orchestration loop uses Tokio as well.

            Meta’s C++ uses Folly coroutines extensively as something quite similar to async/await.

          2. 3

            Yes (link here for those interested: https://blog.cloudflare.com/how-we-built-pingora-the-proxy-that-connects-cloudflare-to-the-internet/). My reading is that the performance issues were due to nginx not reusing connections across cores, which could be solved without Tokio. Cloudflare could have opted to use mio directly for example. On the other hand I do understand their choice to use Tokio here because it probably helped them ship much faster. I would not be surprised if they eventually swap out Tokio for a custom runtime (or maybe even use mio directly) since there would probably be some extra performance to be gained.

            Note that I edited my blog post a bit to reflect the fact that hyperscalers are using async/await.

            1. 8

              I’ve looked at using mio directly. It’s much more difficult than using Tokio.

            2. 7

              My reading is that the performance issues were due to nginx not reusing connections across cores, which could be solved without Tokio.

              That’s not entirely correct. They also had difficulty with nginx’s thread-per-core model which precludes work-stealing.

              Cloudflare could have opted to use mio directly for example.

              Why would they use mio directly and implement a work stealing scheduler atop of Mio? To me, this seems like Tokio, but with extra steps.

              I would not be surprised if they eventually swap out Tokio for a custom runtime (or maybe even use mio directly) since there would probably be some extra performance to be gained.

              I would be. A work-stealing scheduler is almost certainly what most people want/need, but it is especially optimal for load balancers/proxies.

              Note that I edited my blog post a bit to reflect the fact that hyperscalers are using async/await.

              I know you edited this, but the alternative to using async/await is pervasive, feral concurrency control where people constantly spin up new thread pools. async/await, in Rust, is substantially more efficient than the alternatives that people would otherwise gravitate to.

      5. 1

        Rust async/await essentially is Tokio. I have yet to see any code in the wild, or any libraries, which use async/await but not tokio.

        1. 1

          It’s extremely easy to find async code that doesn’t use Tokio. And in any case, this article claims to be a general criticism of the whole async/await paradigm, regardless of runtime or language.

    4. 16

      Async has a bunch of implementation issues, but I’ve been converting sync programs to async and haven’t regretted it.

      Blocking threaded code looks nice and simple only in simple cases where it can just do one thing at a time, start to finish, not caring about latency, or other things happening.

      But once you can have multiple things happening at the same time, multiple threads running, you’re going to have to be able to wait for multiple things and merge multiple results. There’s inherent complexity in this, and threads don’t have nicest APIs for this.

      With events/channels/signals/actors/callbacks, it ends up either as separate scraps of code running on separate threads, with hard to follow adhoc run-time relationships, and/or you build state machines by hand.

      When you try to make this code look more linear, and the events composable in standard ways, you end up reinventing the Promises API. And then you find you still need recursion instead of loops, and merging of events after conditionals (like compiler φ nodes written by hand!), and wouldn’t it be nice that instead of all these nested callbacks, it used just the normal threaded-code-like syntax?

      1. 1

        But once you can have multiple things happening at the same time, multiple threads running, you’re going to have to be able to wait for multiple things and merge multiple results. There’s inherent complexity in this, and threads don’t have nicest APIs for this.

        Not disagreeing with you, but do you think this is inherent to the abstraction of what a “thread” is? (Not being limited to OS threads) What about e.g. Java’s ‘structured concurrency’ https://openjdk.org/jeps/453 ?

        1. 5

          Yes, I think it’s inherent when you frame threads as the antidote to the promises and async/await, because that implies the threaded alternative should use something “simple” that definitely doesn’t have the same problems as async.

          When the argument against async it has its own “color”, with special ways of calling functions, and special ways of getting the results, it means that the alternative doesn’t have that. When the argument is that async is difficult to reason about, because multiple things can be running concurrently, and data shared between tasks may need to be thread-safe, how are you supposed to have a multi-threaded alternative that doesn’t have that?

          When you build a composable structured concurrency abstraction, you end up having special ways of calling functions asynchronously. It’s not the plain simple blocking call() any more, it’s join_handle = scope.fork(() -> call())! It doesn’t simply return you the result like a blocking function would, it returns some Task or JoinHandle, that you need await .join() on.

          If your structured concurrency library offers you race or select, it will want an ability to time out and abort blocking calls. Plain old threads don’t have that built in, so you need to pass around some Context. In Golang, you have functions support the Context, and functions that don’t. A function that didn’t get a Context as an argument can’t pass it down to a function that requires a Context as an argument. It looks like a “color” to me.

          Futures/Promises are an implementation of structured concurrency. When put them against structured concurrency libraries they’re not opposites any more, only different flavors of the same concept where you pick which particular implementation details you prefer.

          1. [Comment removed by author]

    5. 10

      Pin<Box<dyn Future<Output = ()> + Send + ’_>> (actual production code).

      They have played us for absolute fools!

    6. 9

      If find this very interesting because the original promise of async/await is that you’ll be able to linearize the flow of you programs while still enjoying the benefits of async I/O primitives. But now, if your program flow includes CPU-work anywhere, you will still have to deal with all the synchronization and non-linear control flow patterns associated with multi-threaded programming. Right back were we started.

      …though choosing a channel implementation like Flume which supports async operation can at least make it less painful by letting you plop a tokio::sync::oneshot into each of the “work order” structs you dispatch down the Flume to your worker pool as a completion semaphore which can carry a return value and then await the other end of it. (That’s how I implemented non-disk-thrashing on-demand generation of the thumbnail cache for the “miniserve, but an image gallery” project I really need to get back to and finish preparing to make public.)

      Explicitly high-level languages such as C# have a much easier time, since they can get away with more abstractions. As Rust becomes more of a systems programming language, it becomes less suitable for async/await. The two goals simply do not align.

      I think you mean a low-level programming language. Rust has always been a systems programming language, as have C#, Java, and Go.

      Anytime a library adopts async/await, all its consumers must adopt it too. Aysnc/await poisons across project boundaries. There are some nuances here. Technically, there are ways to deal with this. In practice though, poisoning happens all the time.

      I have to agree with this. Hell, I have a project where I managed to keep it sync… but because I’m using WASI Preview 2 and wit-bindgen to avoid having to manually write a ton of uncomfortable boilerplate for my plugin API, and wasmtime-wasi forces wasmtime’s async flag on, Tokio gets pulled into the dependency tree despite plugins not being granted any of the WASI APIs that async was used to implement, my bindings being generated async: false, my not spinning up any Tokio runtime, and Tokio (ideally) being LTOed back out.

      (Seriously. The most complex plugin I have so far reports that it wants to be called on encountering a booklinks HTML tag and that it wants to receive the contents of the isbn attribute, is implemented as a pure function with no side-effects or external I/O, embeds an ISBN parsing/validating/prettyprinting crate, and returns a Result containing a chunk of rendered HTML as a String.)

      Whether it’s keyword generics or some other solution, we need a way to make it feasible for unpaid hobby developers with no time to write two implementations to not lock their libraries into asyncness.

    7. 9

      Rust is very explicitly designed to he usable for small microcontrollers through to large distributed systems. This needs to be kept in mind when discussing whether or not async/await is better or worse for a specific subset of use cases. A feature this core must be suitable to every use case.

      I don’t really think there is any other approach that Rust could have taken that would meet all the constraints besides not doing anything at all, which for a lot of the world that does need to write asynchronous code would make the experience worse (of which a lot has been said before including in the other comments on this post).

    8. 8

      I want to start off by acknowledging that async Rust has a number of structural issues, particularly around cancellations. That being said, I will once again share my blog post where I talk about how I switched to async Rust for cargo-nextest: https://sunshowers.io/posts/nextest-and-tokio/

      I genuinely do not care about c10k. But I simply do not believe it is possible to write nextest to the quality level it’s at using purely synchronous abstractions. You would have to dip into OS-specific asynchronicity (epoll, IOCP etc) and I would much rather use a platform-independent abstraction over that — which is exactly what async Rust provides.

      Now you could say that nextest is one of the few kinds of programs that should use async Rust. But honestly I don’t buy that — even if most other programs have less complicated state machines than nextest, I think the crossover point where async Rust is justified is well before that. I would say that as soon as you’re spinning up more than 2-3 OS threads just to perform a crossbeam select, you should consider async Rust.

      1. 2

        You would have to dip into OS-specific asynchronicity (epoll, IOCP etc) and I would much rather use a platform-independent abstraction over that — which is exactly what async Rust provides.

        There are loads of wrappers around platform specific event loop and polling mechanisms that don’t require async though, async runtimes go much further.

        1. 5

          Do they let you select over arbitrary sources of asynchronicity in a platform-independent manner? Signalfd and pidfd are only available on Linux.

          1. 1

            Varies by library what is supported but usually files, timers and signal handling are the minimum.

    9. 15

      The post is unfortunately very obviously written by somebody who never had a network filesystem hang up on them in their life.

      1. 4

        Async/await can help with that. But seems a bit of a stretch to present it as if it were the only possible solution.

        1. 11

          The other solutions end up looking very similar to async, but without the syntax sugar to make the code linear.

          Timeouts and cancellation are important for reliability, but unfortunately the sync APIs we have suck at that. They’re all forever-blocking non-cancellable by default. Timeouts are ad-hoc per library per API. You can easily forget to set one up, or call some library that didn’t, and it will get your thread stuck sooner or later.

          For cancellation at application level you need custom solutions. So far I haven’t seen anything nicer than passing a context everywhere and regularly checking if ctx.cancelled() return.

          1. 2

            I agree with that. That’s why I mentioned it in the “Async/await has good parts” section. I do feel like most arguments for async/await boil down to this specific point. So it is up to the programmer to decide if async/await’ing the entire codebase is worth it to get ergonomic timeouts and (somewhat) ergonomic cancellation.

            1. 5

              You don’t need to async/await the entire codebase. You only need to do so for the parts of your program that deal with IO and orchestration.

              Most of nextest is synchronous – there’s just really one file (the core runner loop) that’s async. For example, nextest’s config parsing code is completely synchronous.

    10. 4

      First of all, I agree with quite a few of the drawbacks of async Rust in this article! There are a lot of sharp edges in async Rust (if I found a magic lamp today, my 3 wishes would be for stable coroutines, async drop, and world peace)

      But personally, I feel like I get a lot of benefits out of async Rust. I usually default to async-first, even for small CLI tools or throwaway scripts. That’s partially driven by the ecosystem adoption, but I do feel like I get a lot of mileage out of async.

      Example: let’s say I need to download 1,000 files or so. What are my options for scheduling the downloads in sync Rust?

      • Spawn a thread per download. Oops, that just ate 4 GB of memory due to the per-thread stack size Edit: see below, this is wrong
      • Spawn some pool of worker threads, use channels to distribute work to each one
      • Use Rayon to spawn each download to put it in its thread pool
      • Use Rayon’s .par_iter() on the iterator of things to download

      Okay, and what about async Rust? (and I’ll assume Tokio just because that’s what I’m most familiar with)

      • Spawn a Tokio task per download
      • Spawn some pool of worker tasks, use channels to distribute work to each one
      • Use a LocalSet to spawn all the downloads so they run on the same OS thread
      • Use a futures::FuturesUnordered to spawn all the downloads so they run as part of one Tokio task
      • Build a stream from the iterator of things to download
      • Use a stream with .buffered(n) so only n downloads run concurrently
      • Do any of the options from sync Rust (with careful use of spawn_blocking or channels)

      I’m sure there’s more options I didn’t think of, but the point I want to hammer home is that I get a lot more options for managing concurrent work in async Rust than I do with sync Rust. In fact, it’s pretty easy to see some parallels between the sync options and async options! If you’re already content with using a pool of worker threads with channels, it’s generally pretty easy to just move to worker tasks in Tokio

      So the key benefit to me of async Rust is that I can just tokio::spawn as much work as I have without really thinking too much about the cost (or shape the work into multiple tasks in the way I see fit), then separately configure the runtime to control how that work should be split across OS threads. In async Rust, I can manage concurrency separately from parallelism– that’s not really possible today with sync Rust, or at least not to the same level

      1. 8

        Spawn a thread per download. Oops, that just ate 4 GB of memory due to the per-thread stack size

        No, spawning 1000 threads to download stuff would not allocate 4 GB of memory. A cost of a thread is a couple of pages, so I would expect this program to run in under a hundred megs

        1. 2

          Ah, fair (in my defense, I checked the Rust stdlib where the default stack size per thread is 4 MB, but I didn’t consider how that interacted with virtual memory pages)

          Just for my own understanding though, how would that work on a 32-bit system? Even though each thread only needs to use a few pages of memory, I’m assuming that 4 MB per thread is still reserved in the process’s address space, meaning 1000 threads at 4 MB of stack size (even unused) would basically be the upper limit? Or is that not how it works? (It’s basically a moot point nowadays of course since 64-bits of address space is unfathomably huge)

          1. 3

            Yes, on a 32 bit system you’d run out of address space! So at some point a mmap call to setup new thread’a stack would fail, and thread::spawn would error out.

            Which is not that different from the actual behavior you’d observe on a modern os. Spawning would start erroring out somewhere between 1k and 10k threads, it’s just that this wouldn’t have anything to do with memory in particular.

      2. 3

        I usually default to async-first, even for small CLI tools or throwaway scripts. That’s partially driven by the ecosystem adoption, but I do feel like I get a lot of mileage out of async.

        For me, small CLI tools and throwaway scripts are where I most want to avoid async, because Cargo’s incremental build caching is fragile and, even at the best of times, I don’t want a pile more dependencies slamming my build times in the face after every rustup update.

        Hell, aside from the lack of memory-safe QWidget bindings, a sufficiently advanced SQLite+PostgreSQL abstraction, and Django’s ecosystem, the build times involved in spinning up “a little script” are the main thing I think about when deciding whether to write something in Rust to plan for the inevitable “not really a throwaway after all” moment.

        (Serde, Clap or Gumdrop as appropriate, Rayon, ignore, etc. are very desirable things for little scripts and CLI tools.)

    11. 3

      I’m a bit surprised the article doesn’t mention the main issue with using threads for a high number of concurrent tasks: the memory used by each thread for its call stack.

      1. 10

        Memory usage is not the main issue with threads. Memory usage of Go’s goroutines and threads are not that different. I think it’s like 4x-8x difference? Which is not small, of course, but given that memory for threads is only fraction of memory the app uses, it’s actually not that large in absolute terms. You need comparable amount of memory for buffers for TCP sockets and such.

        As far as I can tell, the actual practical limiting factor for threads is that modern OSes, in default configuration, just don’t allow you to have many threads. I can get to a million threads on my Linux box if I sudo tweak it (I think? Don’t remember if I got to a million actually).

        But if I don’t do OS-level tweaking, I’ll get errors around 10k threads.

        1. 3

          Not touching on the rest of the comment because it’s not something I have extensive experience with, but I do want to point out that Go isn’t really the best comparison point since goroutines are green threads with growable stacks, so their memory usage is going to be lower than native threads. Any discussion about memory usage of threads probably also needs to account for overcommit of virtual vs resident memory.

          All of this is moot in Rust due futures being stackless, so my understanding (I reserve the right to be incorrect) is that in theory they should always use less memory than a stackfull application.

          1. 4

            That’s is precisely my point: memory efficiency is an argument for stackless coroutines over stackful coroutines, but it is not an argument for async io (in whichever form) over threads.

        2. 1

          Sorry for reading and replying to your comment so late. The minimum stack size of a goroutine is 2 kB. The default stack size of a thread on Linux is often 8 MB. Of course, the stack size of most goroutines will be higher. And similarly, it is usually possible to reduce the stack size of a thread to 1 MB or less if we can guarantee the program will never need more. Is it how you concluded that the difference was somewhere around 4x-8x?

          I like your point about the fact that the memory for buffers, usually allocated on the heap, should be the same regardless. Never thought of that :)

          1. 1

            The stack size of a thread can be just a few kilobytes on Linux since the pages don’t actually get mapped until accessed.

            1. 2

              I know, but I’ve always wondered what happens when a program has hundreds of thousands of threads. Will the TLB become too large with a lot of TLB miss making the program slow? When the stack of a thread grows from 4 kB to 8kB, how is that mapped to physical memory? Does it mean there will 2 entries in the TLB, one mapping the first 4 kB, and another the second 4 kB? Or will the system allocate a contiguous segment of 8 kB, and copy the first 4 kB to the new memory segment? I have no idea how it works concretely. But I would expect these implementation “details” to impact performance when the number of threads is very large.

              1. 2

                I did some reading and will try to answer my own questions :)

                Q: Will the TLB become too large with a lot of TLB miss making the program slow?

                A: The TLB is a cache and has a fixed size. So no, the TLB can’t become “too large”. But if the working set of pages becomes too large for the TLB, then yes there will be cache misses, causing TLB thrashing, and making the program slow.

                Q: When the stack of a thread grows from 4 kB to 8kB, how is that mapped to physical memory?

                The virtual pages are mapped to physical pages on demand, page per page.

                Q: Does it mean there will 2 entries in the TLB, one mapping the first 4 kB, and another the second 4 kB?

                A: Yes. At least this is the default on Linux, as far as I understand.

                Q: Or will the system allocate a contiguous segment of 8 kB, and copy the first 4 kB to the new memory segment

                No.

                Q: I would expect these implementation “details” to impact performance when the number of threads is very large.

                A: If the stacks are small (a few kB), then memory mapping and TLB thrashing should not be a problem.

          2. 1

            It’s 8 megs of virtual memory. Physically, only a couple of pages will be mapped. A program that spawns a million threads will use dozens of megs not 8 gigs of RAM.

            1. 2

              Correct, I keep forgetting about this. But assuming that each thread maps at least a 4 kB page, and that the program spawns a million threads, then it should use 1 million x 4 kB = 4 GB, and not dozens of megs? Or am I missing something?

              1. 1

                typo, meant to say thousand! But I guess you could flip that around and say that dozen of gigs is enough for million threads, not for a thousand!

                1. 1

                  I like that: “a dozen of gigs are enough for a million threads” :)

      2. 4

        The stack size isn’t a problem at all. Threads use virtual memory for their stacks, meaning that if the stack size is e.g. 8 MiB, that amount isn’t committed until it’s actually needed. In other words, a thread that only peaks at 1 MiB of stack space will only need 1 MiB of physical memory.

        Virtual address space in turn is plentiful. I don’t fully remember what the exact limit is on 64 bits Linux, but I believe it was somewhere around 120-something TiB. Assuming the default stack size of 8 MiB of virtual memory and a limit of 100 TiB, the maximum number of threads you can have is 13 107 200.

        The default size is usually also way too much for what most programs need, and I suspect most will be fine with a more restricted size such as 1 MiB, at which point you can now have 104 857 600 threads.

        Of course, if the amount of committed stack space suddenly spikes to e.g. 2 MiB, your thread will continue to hold on to it until it’s done. This however is also true for any sort of userspace/green threading, unless you use segmented stacks (which introduce their own challenges and problems). In other words, if you need 2 MiB of stack space then it doesn’t matter how clever you are with allocating it, you’re going to need 2 MiB of stack space.

        The actual problems you’ll run into when using OS threads are:

        • An increase in context switching costs, which may hinder throughput (though this is notoriously difficult to measure)
        • Having to tune various sysctl settings (e.g. most Linux setups will have a default limit of around 32 000 threads per process, requiring a sysctl change to increase that). Some more details here
        • Different platforms behaving widely differently when having many OS threads. For example, macOS had (not sure if this is still the case) a limit of somewhere around 2000 OS threads per OS process
        • The time to spawn threads isn’t constant and tends to degrade when the number of OS threads increases. I’ve seen it go up all the way to 500 milliseconds in stress tests
        • Probably more that I can’t remember right now

        Of these, context switch costs are the worst because there’s nothing you as a user/developer can do about this, short of spawning fewer OS threads. There also doesn’t appear to be much interest in improving this (at least in the Linux world that I know of), so I doubt it will (unfortunately) improve any time soon.

        1. 4

          Of these, context switch costs are the worst

          What is the canonical resource that explains why context switch cost differs between the two? I used to believe that, but I no longer do after seeing

          https://github.com/jimblandy/context-switch

          And, specifically,

          In these runs, I’m seeing 18.19s / 26.91s ≅ 0.68 or a 30% speedup from going async. However, if I pin the threaded version to a single core, the speed advantage of async disappears:

          So, currently I think I don’t actually know the relative costs here, and I choose not to believe anyone who claims that they know, if they can’t explain this result.

          EDIT: to clarify, it very well might be that the benchmark is busted in some obvious way! But it really concerns me that I personally don’t have a mental model which fits the data here!

          1. 5

            From what I understand, there are two factors at play (I could be wrong about both, so keep that in mind):

            1. The time of a context switch is somewhere in the range of 1 to 2 microseconds
            2. With more threads running, the number of context switches may increase

            The number of context switches is something you might not be able to do much about, even with thread pinning. If you have N threads (where N is a large number) and you want to give them a fair time slice, you’re going to need a certain number of context switches to achieve that.

            This means that we’re left with reducing the context switch time. When doing any sort of userspace threading, the context switch time is usually in the order of a few hundred nanoseconds at most. For example, Inko can perform a context switch in somewhere between 500 and 800 nanoseconds, and its runtime isn’t even that well optimized.

            To put it differently, it’s not that context switching is slow, it’s that it isn’t fast enough for programs that want to use many threads.

            1. 2

              Your two comments here are some of the best things I’ve read about the topic in a while! Consider writing a blog post about this whole thing! In particular,

              With more threads running, the number of context switches may increase

              Is not something I’ve heard before, and it makes some sense to me (though, I guess I still need to think about more — with many threads, most threads should be idle (waiting for IO, not runnable)).

              1. 2

                I did write about this as part of this article about asynchronous IO, which refers to some existing work I based my comments on.

          2. 3

            I’d been waiting for someone who knew more about the kernel guts to comment, but I guess that’s not going to happen, so here goes.

            The context switching cost shouldn’t depend on the number of threads that exist, although there were one or two Linux versions in the early 2000s with a particularly bad scheduler where it did. I don’t buy that the number of context switches (per unit time) increases with the number of threads either in most cases; in a strongly IO-bound program it will depend solely on the number of blocking calls, and when CPU-bound it will be limited by the minimum scheduling interval*.

            I am not convinced about the method in the repo you linked. Blocking IO and async are almost the same thing if you force them to run sequentially. Whether this measures context switch overhead fairly is beyond my ken, but I will say that a reactor that only ever dispatches one event per loop is artificially crippled. It’s doing all its IO twice.

            Contrary to what one of the GH issues says, though, it’s probably not doing context switches. Like pretty much any syscall epoll_wait isn’t a context switch unless it has to actually wait.

            This isn’t a degenerate case for blocking IO and that’s enough to make up for a bit of context switching. I think that’s all there is to it.

            In general, though, the absolute cost of blocking IO is lower than I think almost everyone assumes. Threads that aren’t doing anything only cost memory (to the tune of a few KiB of stack) and context switches are usually a drop in the ocean compared to whatever work your program is actually doing. I think a better reason to avoid lots of threads is the loss of control over latency distribution. Although terrible tail latency with lots more threads than cores is often observed, I don’t know that I’ve ever read a particularly convincing explanation for this.

            * Although that is probably too low (i.e. frequent) by default.

      3. 4

        RAM is a lot cheaper nowadays than it was when c10k was a scaling challenge; 10,000 connections * 1 MiB stack is only ~10 GiB. Even if you want to run a million threads on consumer-grade server hardware (~ 1 TiB), the kinds of processes that have that kind of traffic (load balancers, HTTP static assets, etc) can usually run happily with stack sizes as small as 32 KiB.

        1. 4

          RAM is a lot cheaper nowadays than it was when c10k was a scaling challenge

          That just means that the bar should be higher now: c10M or maybe c100M.

          As for running OS threads with a small stack size, why should we have to tune that number when, with async/await, the compiler can produce a perfectly sized state machine for the task?

          1. 4

            That just means that the bar should be higher now: c10M or maybe c100M.

            If your use case requires handling ten million connections per process, then you should use the high-performance userspace TCP/IP stack written by the hundred skilled network engineers in your engineering division.

            Don’t try to write libraries to solve problems at any level of scale. Use a simple library to solve simple problems (10,000 connections on consumer hardware), and a complex library to solve complex problems (millions of connections on a 512-core in-house 8U load balancer).

            As for running OS threads with a small stack size, why should we have to tune that number when, with async/await, the compiler can produce a perfectly sized state machine for the task?

            Because you’ll need to tune the numbers anyway, and setting the thread stack size is a trivial tuning that lets you avoid the inherent complexity of N:M userspace thread scheduling libraries.

    12. 1

      The more I read about and use async/await, the more I like Ye Olde Javascript callback hell. Asynchronous code control flow goes right and synchronous goes down. Theyre obviously different and the ergonomica aren’t that bad*.

      Every async/await is trying to make two axes of control flow look like one and its confusing to both us and type systems.

      *youre not using a VT220, make your editor wider. Also you already scroll up and down.

      1. 2

        It’s just coroutines. Coroutines have worked and looked like that since before callbacks were even invented.

        1. 1

          I’ve never used a modern or pragmatically useful language with ergonomic first-class coroutines, are there any? I’m not sure I’ve used a dynamic one either unless you claim “async/await is coroutines” which I’m not sure i agree with.

          1. 1

            The async/await syntax idea that exists in many languages isn’t coroutines, but the way they work in Rust is just coroutines. When you call an async function you receive a state machine with an associated method to drive it forward. Very similar to iterators, except you need to wait for whatever async operation the async function is waiting on before driving it forward.

            1. 1

              Iterators arent the most ergonomic function of rust either IMO.

    13. 1

      Something tells me the mess of async functions wouldn’t be so horrifying if Rust had real higher-kinded types and, especially, monads.

      1. 11

        Rust’s designers are of course aware of the idea of first-class monads. The fact that Rust doesn’t have them is indicative of something!

        One of several reasons is that complex type hierarchies (like Haskell’s Monad -> Applicative -> Functor) are hard to stabilize. Haskell is more of an academic language so it can get away with occasional API breakage, as happened when Applicative was inserted in between Monad and Functor. Rust focuses on stability as more of an industrial language.

        1. [Comment from banned user removed]

          1. 14

            Of a skill issue, perchance.

            👍

          2. 3

            Rust doesn’t focus on stability, it just hasn’t existed for long enough to face this choice yet.

            But they’ve already made the choice, given the guarantees, and have been and continue to be very conservative about additions precisely to make sure the mistakes don’t pile on too quickly. The time to focus on avoiding the mistakes C++ and Python made are precisely when a language is young, before it’s too late.

            1. [Comment from banned user removed]

      2. 7

        It’s a general rule that the Rust teams are well-versed in those sorts of things and, if you look at the mailing lists of Rust’s pre-1.0 days, you’ll find “This is a good idea. We should have it.” attitudes that ran afoul of other concerns.

        The devs wanted higher-kinded types, but had to settle for generic associated types (GATs) because languages like Haskell get away with relying on garbage collection to make it work and Rust is pushing the envelope for what can be done without garbage collection as-is.

        As for monads, see https://web.archive.org/web/20220118203747/https://twitter.com/withoutboats/status/1027702531361857536

        MONADS AND RUST A THREAD
        - or -
        why async/await instead of do notation

        1. 15

          The issue with HKT is that Rust doesn’t have currying, nothing to do with lifetimes. I wrote it in that very thread:

          The problem is that without currying at the type level, higher kinded polymorphism makes type inference trivially undecidable. We have no currying. In order to add higher kinded polymorphism, we’d have to restrict the sorts of types you could use in a way that would feel very arbitrary to users. In contrast, generic associated types don’t have this problem, and directly solve the expressiveness problems we do have, like Iterable (abstracting over everything with a .iter(&self) method)

          The problem is that solving equations with multiple parameters is not decidable (I don’t remember if its the same problem as semi-unification or not but that feels right to me), so you’d have to do something like say the variable comes always before/after all the other types. IE the language could type functions of the form |T| Result<T, io::Error> or type functions of the form |E| Result<i32, E> but not both. Haskell solves this by using currying so you’re inherently restricted in that way. This didn’t seem worth pursuing instead of just GATs.

          But you are right that we were fully aware of monads throughout the design process. Grandparent comment is the kind of shallow drive-by nonsense I’m used to experiencing, to say nothing of the blog post we’re all commenting on.

          1. 10

            Not directly related, but this reminds me of how a Haskell programmer I know who was learning Rust pointed out that Haskell’s result type (Either) typically puts the error type before the success type, while Rust does the opposite. This is because in Haskell you can create the equivalent of Result<T> by type currying, while in Rust it’s more natural to define a type alias Result<T, E = Error>.

            1. 9

              It is directly relevant in that it means that by convention, Haskell has decided you can only have the equivalent of |T| Result<T, Error> and not the equivalent of |E| Result<i32, E>. That’s what Either Error means in Haskell. Probably this is the better of the two options, but its an implication of having only single argument functions and a necessary limitation to make higher kinded polymorphism work with a decidable algorithm, whereas Rust has decided to use projection to a higher kinded variable to avoid these issues.

              1. 3

                That makes sense! Thanks.

            2. 3

              Huh, I thought it was just a play on words, things went wrong / things went right, left type / right type.

              1. 2

                I think it’s a happy coincidence.

          2. 2

            I’ll trust that my memory has become too foggy then… and I can certainly see how type inference would be the issue.

            C++ doesn’t exactly lean hard onto it.

            1. 8

              C++ avoids these issues by using templates and instead of what I would consider a real polymorphism system, so it generates the monomorphic code before typechecking it. That comes with its own downsides which are well known.

        2. 2

          In my experience this:

          the signature of >>= is m a -> (a -> m b) -> m b the signature of Future::and_then is roughly m a -> (a -> m b) -> AndThen (m a) b

          Is the issue I most often run in first when trying to do some higher type-level things in Rust (eg. it is why tower’s services can’t be Arrows), imo because of Rust culture of avoiding boxing and dynamic dispatch. For monads with some sense of laziness, each combinator tends to have its own output type, which are abstracted by a trait (Iterator being the poster child).

          Not totally related: @withoutboats if you ever find the will to put into a less twitter-thready form your reflexion on the matter I for one would really appreciate referring to it.

        3. 1

          The devs wanted higher-kinded types, but had to settle for generic associated types (GATs) because languages like Haskell get away with relying on garbage collection to make it work and Rust is pushing the envelope for what can be done without garbage collection as-is.

          How are those connected? C++ does have HKTs and doesn’t rely on garbage collection either.

          As for monads, see https://web.archive.org/web/20220118203747/https://twitter.com/withoutboats/status/1027702531361857536

          The arguments in the thread you linked seem backwards. I would argue they are rather ex post facto justifications than the actual reasons. For example,

          Also, our functions are not a -> type constructor; they come in 3 different flavors, and many of our monads use different ones (FnOnce vs FnMut vs Fn).

          Not all functions in Haskell are of type ->, especially since the LinearTypes extension was merged. Granted, it was designed after Rust was set in stone and with the power of hindsight. But still, this seems possible.

          Okay, so Monad can’t abstract over Future, but still let’s have Monad. Problem: we don’t have higher kinded polymorphism, and probably never will.

          If you don’t have HKTs, you can’t have proper monads. Again, even C++ managed HKTs somehow.

          The problem is that without currying at the type level, higher kinded polymorphism makes type inference trivially undecidable. We have no currying.

          What? Also, how is this a problem? Haskell doesn’t even bother with type inference when you employ more advanced features like GADTs or Type Families. You have to write all types by hand.

          In order to add higher kinded polymorphism, we’d have to restrict the sorts of types you could use in a way that would feel very arbitrary to users.

          No? At least, I don’t see how this follows from the previous posts.

          A twitter thread is a terrible format for explaining the simplest nose picking techniques, let alone the intricacies of a type system in a complex programming language, so I’m sure I got some moments wrong. Still, I remain unconvinced.

          1. 9

            C++ does have HKTs

            What are the error messages like? Language features aren’t real if they don’t have good error messages.

            If it’s just token substitution like most C++ template metaprogramming, then that’s the equivalent of a macro.

            1. 2

              What are the error messages like? Language features aren’t real if they don’t have good error messages.

              If you say so. This immediately disqualifies 74.25% of C++, though, and makes it a mostly unreal language (might explain the choice of language for Unreal Engine).

              Then again, lengthy, overly verbose, and completely gibberish error messages are a feature of C++. Wasn’t there a competition of producing the longest wall of error text given the smallest possible input? If not, there should be.

              then that’s the equivalent of a macro.

              No, it’s not.

          2. 3

            How are those connected? C++ does have HKTs and doesn’t rely on garbage collection either.

            It’s been over half a decade and I’m still kicking myself for not bookmarking the blog post I’m vaguely remembering about the rationale behind “GATs, not HKTs”.

            I think it ties back to lifetimes, where C++ expects the programmer to make things work and Haskell just extends them as needed at runtime, while Rust would require an unacceptably heavy abstraction to do likewise (at minimum, something like re-introducing “worry about whether the struct is POD”) or otherwise unacceptably restrict them.

            As for the rest, I don’t want to argue on withoutboats’s behalf. I will say that, in my time lurking in RFCs, I’ve seen “We’ve already spent a lot of our complexity budget on ownership and borrowing. We don’t want leaky abstractions to burn more of it through seemingly arbitrary restrictions.” being a recurring theme over the years.

            EDIT: See here