This article, and indeed much of the discourse about the facility out in the wild, could really use some expansion on at least these two points:
The application submits several syscalls by writing their codes & arguments to a lock-free shared-memory ring buffer.
That’s not all, though, right? Unless you’re using the dubious kernel-driven polling thread, which I gather is uncommon, you also have to make a submit system call of some kind after you’ve finished appending stuff to your submit queue. Otherwise, how would the kernel know you’d done it?
The kernel reads the syscalls from this shared memory and executes them at its own pace.
I think this leaves out most of the interesting information about how it actually works in practice. Some system calls are not especially asynchronous: they perform some amount of work that isn’t asynchronously submitted to some hardware in quite the same way as, say, a disk write may be. What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work? Whose scheduling quantum is consumed by this?
Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype and get to the actual trade-offs being made.
you also have to make a submit system call of some kind after you’ve finished appending stuff to your submit queue. Otherwise, how would the kernel know you’d done it?
The atomic write to bump the index on how many entries are written is what makes it visible to the kernel. Would be interesting if io_uring had a mode where the kernel could gather & start submissions made available when a thread gets descheduled/preempted. Sort of like a passive SQPOLL thread.
What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work? […] Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype.
It may use a kernel-side thread pool for blocking operations but it will try to complete what it can (non-blocking) at the submission site. The idea is that kernel knows better for how to dispatch these ops than the user e.g. for file read/write: queue nvme op asynchronously for O_DIRECT, use POLLIN/POLLOUT for network device, thread pool for unknown virtio/fs stack, etc. Some of these hints are documented and can be adjusted per ring/op:
skip non-blocking attempt at enter/submission-site (IOSQE_ASYNC).
avoid completions interrupting user threads when ready and instead batch them at usual transition sites (IORING_SETUP_COOP_TASKRUN).
But am still curious; Outside IORING_SETUP_SQ_AFF and IORING_SETUP_SINGLE_ISSUER, what is made possible by knowing which threads perform the submitted work (from userspace perspective)?
what is made possible by knowing which threads perform the submitted work (from userspace perspective)?
It’s an important consideration from a broad system architecture perspective. In heavily utilised systems, it can be very important to make sure the work that a process induces is billed back to that process from a scheduling perspective. If you make a system call that does hot CPU work for 100 milliseconds, it’s pretty obvious that should come out of your scheduling quantum. If other processes are ready to run it may then be their turn.
If the work is being done in god knows what thread somewhere not easily associated with the process that asked for the work, that’s a pretty big problem from a fairness and scheduling perspective. If the work is being done during transitions (e.g., you happen to be returning from a system call but we had this other async work to do so we did that first instead) it seems like that could have a pronounced but variable impact on the latency of some operations.
Yes, these are sort of implementation details, but they’re also things you really need to be able to know about it you’re going to actually operate a heavy use production system that uses these mechanisms.
Assuming process ~ thread ~ task, why associate a system call runtime duration with the caller? Time slices are usually for the task itself rather than the work it schedules. Is the worry that the 100ms spent in the kernel isn’t preemptible to other userspace tasks? Trying to connect the relationship between submitted work and fairness. Similarly with differentiating “work done during transitions” from “scheduler decided to preempted me right after syscall returns to userspace”.
Because duration, in this case, is one of many resources that the OS is sharing between processes. Each second of wall time can only be split so many ways, regardless of whether the CPU work occurs in user or supervisor mode. It’s totally normal for supervisor CPU time used as part of a system call to be billed to the calling process as part of its time slice, otherwise processes could easily escape their resource limit and create noisy neighbour problems for other processes sharing the system.
Similarly in the case of work done during transitions, it’s not clearly the same as preemption. If your timeslice expires or a higher priority process replaces yours, that’s fine. If your time slice doesn’t expire per se, but some system calls end up taking a lot longer than usual because hidden work is now being done during your unrelated call, I feel like that creates a huge potential source of jitter and surprise latency which makes it hard to reason about system performance.
You might want to avoid io_uring if[…] in particular you want to use features with a good security track record.
I admit to being somewhat surprised to read this. Can you elaborate on this? Are you saying io_uring doesn’t have a good security track record because of its general newness, so is like other novel features that just haven’t had their time in the sun being battle-hardened, or have there been flaws or exploits already that should give us pause?
io_uring doesn’t have a good security track record because of its general newness, … or have there been flaws or exploits already that should give us pause?
Missing from any io_uring explanation I’ve read is the reason for the “u” in the name. I know IO stands for input/output. Ring refers to the ring buffer. But what’s “uring”? An “unsigned” ring buffer? A reference to Alan Turing?
Thanks. With the term “user space” to search for, I found a source that might corroborate that etymology, though it’s ambiguously worded. From the io_uring man page:
io_uring gets its name from ring buffers which are shared between user space and kernel space.
I’m guessing it “u” means “io_user-ring” since this is a composition of two different ring buffers that carries user data:
There are two fundamental operations associated with an async interface: the act of submitting a request, and the event that is associated with the completion of said request. For submitting IO, the application is the producer and the kernel is the consumer. The opposite is true for completions - here the kernel produces completion events and the application consumes them. Hence, we need a pair of rings to provide an effective communication channel between an application and the kernel. That pair of rings is at the core of the new interface, io_uring. They are suitably named submission queue (SQ), and completion queue (CQ), and form the foundation of the new interface.
…
The io_uring name should be recognizable by now, and the _cqe postfix refers to a Completion Queue Event. For the rest of this article, commonly referred to as just a cqe. The cqe contains a user_data field. This field is carried from the initial request submission, and can contain any information that the the application needs to identify said request.
Note the user_data field that is carried across the two ring buffers.
It could also mean “unified”? I’m trolling the original patchsets and it is never mentioned why it’s called io_uring: https://lwn.net/Articles/776230/
in particular you want to use features with a good security track record.
On one hand, there’s some evidence (e.g. [1]), and it makes logical sense, that relying on well-worn paths in the kernel improves security. There’s also the difficulty of extending mechanisms like seccomp to io_uring that makes controls harder.
On the other hand, the vast majority of server systems are single-process single-user anyway (or a single process plus a handful of trusted sidecars), often multi-threaded with shared memory inside the processes. So io_uring isn’t really changing your threat model because everything of interest is there in the process anyway. With good modern system design, becoming root doesn’t gain anything real that you don’t get from getting RCE inside the server process anyway. Or, at least, that’s how it should be for server-side things. On the client side, and in embedded, things are different. But if you’re building a server and depending on unix permissions as an isolation boundary there’s like a 99% chance you’re doing something fundamental wrong that not choosing io_uring isn’t going to fix.
Not mentioned is the latency vs throughput tradeoff. By batching a lot of calls together, io_uring optimizes for the latter at the expense of the former. That may make sense, especially in cases where syscalls overhead is substantial, but it’s not by any means universal.
io_uring not only being more performant and better design overall over epoll and friends, it’s actually easier to use and you don’t ever have to worry about whether fd was opened with O_NONBLOCK or not.
This article, and indeed much of the discourse about the facility out in the wild, could really use some expansion on at least these two points:
That’s not all, though, right? Unless you’re using the dubious kernel-driven polling thread, which I gather is uncommon, you also have to make a submit system call of some kind after you’ve finished appending stuff to your submit queue. Otherwise, how would the kernel know you’d done it?
I think this leaves out most of the interesting information about how it actually works in practice. Some system calls are not especially asynchronous: they perform some amount of work that isn’t asynchronously submitted to some hardware in quite the same way as, say, a disk write may be. What thread is actually performing that work? Does it occur immediately during the submit system call, and is completed by the time that call returns? Is some other thread co-opted to perform the work? Whose scheduling quantum is consumed by this?
Without concrete answers that succinctly describe this behaviour it feels impossible to see beyond the hype and get to the actual trade-offs being made.
The atomic write to bump the index on how many entries are written is what makes it visible to the kernel. Would be interesting if io_uring had a mode where the kernel could gather & start submissions made available when a thread gets descheduled/preempted. Sort of like a passive SQPOLL thread.
… while writing that out, looks like io_uring already does something similar & is configurable! https://man.archlinux.org/man/io_uring_setup.2.en#IORING_SETUP_DEFER_TASKRUN
It may use a kernel-side thread pool for blocking operations but it will try to complete what it can (non-blocking) at the submission site. The idea is that kernel knows better for how to dispatch these ops than the user e.g. for file read/write: queue nvme op asynchronously for O_DIRECT, use POLLIN/POLLOUT for network device, thread pool for unknown virtio/fs stack, etc. Some of these hints are documented and can be adjusted per ring/op:
But am still curious; Outside
IORING_SETUP_SQ_AFF
andIORING_SETUP_SINGLE_ISSUER
, what is made possible by knowing which threads perform the submitted work (from userspace perspective)?It’s an important consideration from a broad system architecture perspective. In heavily utilised systems, it can be very important to make sure the work that a process induces is billed back to that process from a scheduling perspective. If you make a system call that does hot CPU work for 100 milliseconds, it’s pretty obvious that should come out of your scheduling quantum. If other processes are ready to run it may then be their turn.
If the work is being done in god knows what thread somewhere not easily associated with the process that asked for the work, that’s a pretty big problem from a fairness and scheduling perspective. If the work is being done during transitions (e.g., you happen to be returning from a system call but we had this other async work to do so we did that first instead) it seems like that could have a pronounced but variable impact on the latency of some operations.
Yes, these are sort of implementation details, but they’re also things you really need to be able to know about it you’re going to actually operate a heavy use production system that uses these mechanisms.
Assuming process ~ thread ~ task, why associate a system call runtime duration with the caller? Time slices are usually for the task itself rather than the work it schedules. Is the worry that the 100ms spent in the kernel isn’t preemptible to other userspace tasks? Trying to connect the relationship between submitted work and fairness. Similarly with differentiating “work done during transitions” from “scheduler decided to preempted me right after syscall returns to userspace”.
Because duration, in this case, is one of many resources that the OS is sharing between processes. Each second of wall time can only be split so many ways, regardless of whether the CPU work occurs in user or supervisor mode. It’s totally normal for supervisor CPU time used as part of a system call to be billed to the calling process as part of its time slice, otherwise processes could easily escape their resource limit and create noisy neighbour problems for other processes sharing the system.
Similarly in the case of work done during transitions, it’s not clearly the same as preemption. If your timeslice expires or a higher priority process replaces yours, that’s fine. If your time slice doesn’t expire per se, but some system calls end up taking a lot longer than usual because hidden work is now being done during your unrelated call, I feel like that creates a huge potential source of jitter and surprise latency which makes it hard to reason about system performance.
I admit to being somewhat surprised to read this. Can you elaborate on this? Are you saying io_uring doesn’t have a good security track record because of its general newness, so is like other novel features that just haven’t had their time in the sun being battle-hardened, or have there been flaws or exploits already that should give us pause?
Previously on Lobsters… https://lobste.rs/s/wh2oze/put_io_uring_on_it_exploiting_linux_kernel
Both, I guess.
Not much, what’s io_uring with you?
Missing from any io_uring explanation I’ve read is the reason for the “u” in the name. I know IO stands for input/output. Ring refers to the ring buffer. But what’s “uring”? An “unsigned” ring buffer? A reference to Alan Turing?
U stands for user space. The ring is shared memory — there’s a literal array of bytes in your address space which you can directly poke.
Thanks. With the term “user space” to search for, I found a source that might corroborate that etymology, though it’s ambiguously worded. From the
io_uring
man page:This is actually a really good question. https://kernel.dk/io_uring.pdf
I’m guessing it “u” means “io_user-ring” since this is a composition of two different ring buffers that carries user data:
…
Note the
user_data
field that is carried across the two ring buffers.It could also mean “unified”? I’m trolling the original patchsets and it is never mentioned why it’s called
io_uring
: https://lwn.net/Articles/776230/There is a reference to changing the name to
turing
so maybe you are closer? https://lwn.net/Articles/777066/Good article.
On one hand, there’s some evidence (e.g. [1]), and it makes logical sense, that relying on well-worn paths in the kernel improves security. There’s also the difficulty of extending mechanisms like seccomp to io_uring that makes controls harder.
On the other hand, the vast majority of server systems are single-process single-user anyway (or a single process plus a handful of trusted sidecars), often multi-threaded with shared memory inside the processes. So io_uring isn’t really changing your threat model because everything of interest is there in the process anyway. With good modern system design, becoming root doesn’t gain anything real that you don’t get from getting RCE inside the server process anyway. Or, at least, that’s how it should be for server-side things. On the client side, and in embedded, things are different. But if you’re building a server and depending on unix permissions as an isolation boundary there’s like a 99% chance you’re doing something fundamental wrong that not choosing io_uring isn’t going to fix.
[1] https://www.usenix.org/conference/atc17/technical-sessions/presentation/li-yiwen
Not mentioned is the latency vs throughput tradeoff. By batching a lot of calls together, io_uring optimizes for the latter at the expense of the former. That may make sense, especially in cases where syscalls overhead is substantial, but it’s not by any means universal.
io_uring not only being more performant and better design overall over epoll and friends, it’s actually easier to use and you don’t ever have to worry about whether fd was opened with O_NONBLOCK or not.