Yes this is a great paper. It’s a perspective from kernel implementers, and the summary is that fork() is only good for shells these days, not other apps like servers :)
It reminds me that I bookmarked a comment from the creator of Ninja about using posix_spawn() on OS X since 2016 for performance:
I think basically Ninja can do it because it only spawns simple processes like the compiler.
It doesn’t do anything between fork() and exec(), which is how a shell sets up pipelines, redirects, and does job control (just implemented by Melvin in Oil, with setpgid() etc. )
Also, anything like OCI / Docker / systemd-nspawn does a bunch of stuff between fork() and exec()
It would be cool if someone writes a demo of that functionality without fork(), makes a blog post, etc.
As a way to help alternative kernel implementers, I’d be open to making https://www.oilshell.org optionally use some different APIs, but I’m not sure of the details right now
It doesn’t do anything between fork() and exec(), which is how a shell sets up pipelines, redirects, and does job control (just implemented by Melvin in Oil, with setpgid() etc. )
Most of those things can be done with posix_spawn_file_actions_t.
posix_spawn_file_actions_t a,b;
int p[6];pid_t c[2];
pipe(p),pipe(p+2),pipe(p+4);
posix_spawn_file_actions_adddup2(&a,*p,0);
posix_spawn_file_actions_adddup2(&a,p[3],1)
posix_spawn_file_actions_adddup2(&b,p[2],0);
posix_spawn_file_actions_adddup2(&b,p[5],1);
posix_spawn(c,"a",&a,NULL,...);
posix_spawn(c+1,"b",&b,NULL,...);
// write to p[1] to put into head of pipeline (a) read the end of the pipelline (b) from p[4]
// wait for *c(a) and c[1](b) for status code.
You can get setpgid with POSIX_SPAWN_SETPGROUP, and the signal mask with posix_spawnattr_setsigmask.
I guess this kind of thing is what I’m looking for
4/ Linux does posix_spawn in user space, on top of fork - but sometimes vfork (an even faster, more dangerous fork). The vfork usage is what can make posix_spawn faster on Linux.
The “why” was largely political, IMO: The spawn() family of functions were introduced into POSIX so that Windows NT could get a POSIX certification and win government contracts, since it had spawn() and not fork().
Tannenbaum’s Modern Operating Systems was written around this time, and you might find its discussion of process-spawning APIs interesting: He doesn’t mention performance, and indeed Linux’s fork+exec was faster than NT’s CreateProcess so I find it incredibly unlikely NT’s omission for fork() was for performance, but more likely to simplify other parts of the NT design.
I guess this kind of thing is what I’m looking for
The suggestion to run a subprocess that calls tcsetpgrp before exec isn’t a bad one, and maybe obviates some of the performance benefits you get from posix_spawn, but it might not be so bad because that subprocess can be a real simple tiny static binary that does what it needs to and calls exec(). One day maybe we won’t have to worry about this.
Another option is to just wait+WIFSTOPPED and then kill+SIGCONT it if it’s supposed to be in the foreground.
Very strange claim. AFAIK it was possible to implement fork on top of Windows NT native API right from the beginning. (I can try to find links.) And early Windows POSIX subsystems (including Interix) actually implemented fork. (This was long before WSL happened.) And Interix actually directly implemented fork on top of Windows NT native API, as opposed to very hacky Cygwin’s fork implementation.
Also IIRC the very first Windows POSIX subsystem happened before posix_spawn was added to POSIX. (Windows had a lot of different official POSIX subsystems authored by Microsoft, WSL is the last one.)
AFAIK it was possible to implement fork on top of Windows NT native API right from the beginning.
I think you’re thinking of zwCreateSection, but I don’t think this was a win32 API call (or even well-documented), and it takes a careful reading to see how fork could be implemented with it, so I don’t think this is the same as having fork() – after all, there’s got to be lots of ways to get a fork-like API out of other things, including:
as opposed to very hacky Cygwin’s fork implementation.
I remembered they claimed they had reasons for not using zwCreateSection but I don’t know enough to know what problems they ran into though.
He doesn’t mention performance, and indeed Linux’s fork+exec was faster than NT’s CreateProcess so I find it incredibly unlikely NT’s omission for fork() was for performance, but more likely to simplify other parts of the NT design.
It’s not quite that clear cut. Modern versions of NT have a thing called a picoprocess, which was originally added for Drawbridge but is now used for WSL. These are basically processes that start (almost) empty. Creating a new process with CreateProcess[Ex] creates a new process very quickly, but then maps a huge amount of stuff into it. This is the equivalent of execve + ld-linux.so running and that’s what takes almost all of the time.
Even on Linux, vfork instead of fork is faster (especially on pre-Milan x86, where fork needs to IPI other cores for TLB synchronisation).
XNU has a lot of extensions to POSIX spawn that make it almost usable. Unfortunately, they’re not implemented anywhere else. The biggest problem with the API is that they constrained it to permit userspace implementations. As such, it is strictly less expressive than vfork + execve. That said, vfork isn’t actually that bad an API. It would be even better if the execve simply returned and vfork didn’t return twice. Then the sequence would be simply vfork, setup, execve, cleanup.
With a language like C++ that supports RAII, you can avoid the footguns of vfork by doing
pid_t pid = vfork();
if (pid == 0)
{
{
// Set up the child
}
execve(…);
pid = -1;
}
This ensures that anything that you created in between the setup is cleaned up. I generally use a std::vector for the execve arguments. This must be declare in the enclosing scope, so it’s cleaned up in the parent. It’s pretty easy to wrap this in a function that takes a lambda and passes it a reference to the argv and envp vectors, and executes it before the execve . This ensures that you get the memory management right. As a caller, you just pass a lambda that does any file descriptor opening and so on. The wrapper that I use also takes a vector of file descriptors to inherit, so you can open some files before spawning the child, the do arbitrary additional setup in the child context (including things like entering capability mode or attaching to a jail).
I don’t understand how any of those use cases break. In between vfork and execve, you are running in the child’s context, just as you are now. You can drop privileges, open / close / dup file descriptors, and so on. The only difference is that you wouldn’t have the behaviour where execve effectively jongjmps back to the vfork, you’d just return to running in the parent’s context after you started the child.
At the point of vfork, the kernel creates two processes using the same page tables. One is suspended (parent), and the other isn’t (child).
The child continues running until execve. At that point the child process image is replaced with the executable loaded by exec. The parent process is then resumed but with the register/stack state of the child.
Exactly. The vfork call just switches out the kernel data structures associated with the running thread but leaves everything else in place, the execve would switch back and doesn’t do any of the saving and restoring of register state that makes vfork a bit exciting. The only major change would be that execve would return to the parent process’ kernel state even in case of failure.
Actually that tweet thread by the author of fish is very good.
This is exactly what Melvin just added, so looks like we can’t use it ? On what platforms?
8/ What’s missing? One hole is the syscall you’ve never heard of: tcsetpgrp(), which hands-off tty ownership. The “correct” usage with fork is a benign race, where both the parent (tty donator) and child (tty inheritor) request that the child own the tty.
9/ There is no “correct” tcsetpgrp usage with posix_spawn: no way to coax the child into claiming the tty. This means, when job control is on, you may start a process which immediately stops (SIGTTIN) or otherwise. Here’s ksh getting busted: https://mail-archive.com/ast-developers
The way I’m reading this is that ksh has a bug due to using posix_spawn() and lack of tcsetpgrp(). And they are suggesting that the program being CALLED by the shell can apply a workaround, not the shell itself!
This seems very undesirable.
I think we could use posix_spawn() when job control is off, but not when it’s on.
And I actually wonder if this is the best way to port to Windows? Not sure
Linux does posix_spawn in user space, on top of fork - but sometimes vfork
Glibc and musl add more and more optimizations over time, allowing posix_spawn to use vfork (as opposed to fork) in more and more cases. It is quite possible that recent versions of glibc and musl call vfork in all cases.
I would like to do some container-like stuff without Docker, e.g. I’ve looked at bubblewrap, crun, runc, etc. a little bit
We have a bunch of containers and I want to gradually migrate away from them
I also wonder if on Linux at least a shell should distribute a bubblewrap-like tool ! Although I think security takes a long time to get right on Linux, so we probably don’t want that responsibility
It is public domain. I’m sorry for Russian comments.
Compile with gcc -o asjail asjail.c. My program is x86 Linux specific. Run like so: asjail -p bash.
Program has options -imnpuU, which correspond to unshare(1) options (see its manpage). (Also, actual unshare has option -r, I have no such cool option.)
My program usually requires root privileges, but you can specify -U flag, which creates user namespace. So, you can run asjail -pU bash as normal user, this will create new user namespace and then create PID namespace inside it. (Again: unshare -pr bash is even better.)
But user namespace requires that they should be enabled in kernel. In some distros they are enabled by default, in others - not.
I wrote this util nearly 10 years ago. Back then I wanted some lightweight container solution. I was not aware of unshare(1) util. unshare(1) fully subsumes my util. (Don’t confuse with unshare(2) syscall.) Also, unshare(1) is low-level util, it’s lower then bubblewrap, runc, etc.
I don’t remember some details in this code, for example I don’t remember why I need this signal mask manipulations.
Today I use docker and I’m happy with it. Not only docker provide isolation, it is also allows you to write Dockerfiles. And partial results in dockerfiles are cached, so you can edit some line in dockerfile, and docker will rebuild exactly what is needed and no more. Dockerfiles are perfect for bug reports, i. e. you can simply send dockerfile instead of “steps to reproduce” section. The only problem with docker is inability to run systemd inside it. I have read that this is solved by podman, but I didn’t test it. Also, dockerfiles are not quite reproducible, because they are often rely on downloading something from internet. I have read that proper solution is Nix, but I didn’t test it
You need to add -- to make sure options are processed as part of target command, not asjail itself, i. e. asjail -p -- bash -c 'echo ok'.
asjail was written by careful reading of clone(2) manual page.
asjail is not complete solution. It doesn’t do chrooting, mounting needed directories (/proc, /dev, etc). So back then 10 years ago I wrote bash script called asjail-max-1, which does these additional steps. asjail-max-1 was written by careful reading of http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface and so asjail-max-1 (together with asjail) can run systemd in container! asjail-max-1 is screen sized bash script.
But, unfortunately, to make all this work, you also need C program, which emulates terminal using forkpty(3). So I wrote such a program and I called it pty. It is some screens sized C program. Together asjail, asjail-max-1 and pty gives you complete small solution for running containers. In something like 5 screens of code.
I can post all this code.
But all this is not needed, because all this is subsumed by existing tools. asjail is subsumed by unshare. And asjail-max-1+asjail+pty is subsumed by systemd-nspawn.
Today I don’t use any of these my tools. When I need to run container I use docker. If I need to run existing file tree I use systemd-nspawn
Also, all these tools I discussed so far are designed for maximal isolation. I. e. to prevent container from accessing host resources. But sometimes you have opposite task, i. e. you want to run some program and give it access to host resources, for example to X server connection, to sound card etc. So I have another script, which is simple wrapper for chroot, which does exactly this: https://paste.gg/p/anonymous/84c3685e200347299cac0dbd23d31bf3
I want to write my own shell some day. And I don’t want to use fork there, I will use posix_spawn instead. But this will be tricky. How to launch subshell? Using /proc/self/exe? What if /proc is not mounted? I will possibly try to send a patch to Linux kernel to make /proc/self/exe available even if /proc is not mounted. Busybox docs contain (or contained in the past) such patch, so theoretically I can simply copy it from there. I can try to find a link
Also, anything like OCI / Docker / systemd-nspawn does a bunch of stuff between fork() and exec()
Surprisingly, we can combine posix_spawn speed with needs of docker and systemd-spawn. Let me tell you how.
First of all, let’s notice that merely fork is not enough for systemd-nspawn. You need to also put child in new mount/utc/etc namespace. For this you need clone or unshare(2).
Fortunately, clone has flag CLONE_VFORK, which allows us to get vfork-like behavior, i. e. our program will be faster than with fork.
So, to summarize, we have two options one option to combine posix_spawn speed with features systemd-nspawn needs (either one will be enough):
Use clone with all namespacing flags we need (such as CLONE_NEWNS) and CLONE_VFORK
Create new process using usual posix_spawn and then call unshare to put the process into new namespace
I didn’t tested any of this, so it may be possible something will go wrong.
Also: I’m author of my own unshare(1) util analog (don’t confuse with unshare(2) syscall). My util doesn’t do any speed up tricks, I just use plain clone without CLONE_VFORK
Also, I wrote a Rust program, which spawns program using posix_spawn, redirects its in, out and err using posix_spawn_file_actions_adddup2, collects its out and err, waits for finish and reports status
I was reminded of this by the discussion about problems with unix process APIs. In general I like the idea of a capability-oriented process API, which is basically what pidfd is. Then if process-manipulating syscalls require a process capability, operating on other processes is just like operating on your own, which also solves the issue of awkwardly huge spawn APIs, as in section 6 “Replacing fork - Low-level: Cross-process operations.”
I think there’s a contradiction in this paper, tho: I think in principle if it’s possible to explicitly donate a share of parts of the current process’s state to a new empty process, then it’s possible to implement fork in userland, which is basically what that subsection of section 6 says. But that contradicts section 5 “Implementing fork - Fork infects an entire system” which claims “an abstraction, fork fails to compose: unless every layer supports fork, it cannot be used.”
I don’t think there’s a contradiction. A userland fork would also not compose with other things, but having a feature in one library that doesn’t compose with features in other libraries is not a new problem: it’s inherent to the UNIX shared library model (and a regression from MULTICS).
operating on other processes is just like operating on your own, which also solves the issue of awkwardly huge spawn APIs, as in section 6 “Replacing fork - Low-level: Cross-process operations.”
If you like this article (“fork in the road”), you will like io_uring_spawn, too. I think io_uring_spawn solves all fork problems, while being faster than all other solutions, even faster than vfork and posix_spawn. Slides above include table with time comparison.
Also, “fork in the road” includes this statement:
…clean-slate designs [e.g., 40, 43] have
demonstrated an alternative model where system calls that
modify per-process state are not constrained to merely the
current process, but rather can manipulate any process to
which the caller has access. This yields the flexibility and
orthogonality of the fork/exec model, without most of its
drawbacks: a new process starts as an empty address space,
and an advanced user may manipulate it in a piecemeal fashion, populating its address-space and kernel context prior to
execution, without needing to clone the parent nor run code
in the context of the child. ExOS [43] implemented fork in
user-mode atop such a primitive. Retrofitting cross-process
APIs into Unix seems at first glance challenging, but may
also be productive for future research.
The article mentions (IIRC) that fork interacts badly with sysctl vm.overcommit_memory=2 on Linux. This is very bad, because sysctl vm.overcommit_memory=2 is what you (from perfectionist view) should do always (unfortunately, this breaks some software in real world). You can read about overcommit here: https://ewontfix.com/3/
This paper reads of sour grapes. I think fork() is great, but maybe some other metaphors would be great as well.
Once upon a time, I was working on a DNS server, and after finding it a bit slow, did a:
fork(),fork();
after binding the sockets and immediately got it running multicore. Yes, I could have rewrote things to allow main() to accept the socket to share, or to have a server process pass out threads, and then I could have used spawn, but I have a hard time being mad at those 14 bytes.
Yes this is a great paper. It’s a perspective from kernel implementers, and the summary is that fork() is only good for shells these days, not other apps like servers :)
It reminds me that I bookmarked a comment from the creator of Ninja about using posix_spawn() on OS X since 2016 for performance:
https://news.ycombinator.com/item?id=30502392
I think basically Ninja can do it because it only spawns simple processes like the compiler.
It doesn’t do anything between fork() and exec(), which is how a shell sets up pipelines, redirects, and does job control (just implemented by Melvin in Oil, with setpgid() etc. )
Also, anything like OCI / Docker / systemd-nspawn does a bunch of stuff between fork() and exec()
It would be cool if someone writes a demo of that functionality without fork(), makes a blog post, etc.
Related from the Hacker News thread:
As a way to help alternative kernel implementers, I’d be open to making https://www.oilshell.org optionally use some different APIs, but I’m not sure of the details right now
Most of those things can be done with posix_spawn_file_actions_t.
You can get setpgid with POSIX_SPAWN_SETPGROUP, and the signal mask with posix_spawnattr_setsigmask.
Oh cool, I didn’t know that .. Are there any docs or books with some history / background on posix_spawn()?
When I google I get man pages, but they sort of tell “what” and not “why”
And random tweets, the poorest medium for technical information :-( https://twitter.com/ridiculous_fish/status/1232889391531491329?lang=en
I guess this kind of thing is what I’m looking for
The “why” was largely political, IMO: The spawn() family of functions were introduced into POSIX so that Windows NT could get a POSIX certification and win government contracts, since it had spawn() and not fork().
Tannenbaum’s Modern Operating Systems was written around this time, and you might find its discussion of process-spawning APIs interesting: He doesn’t mention performance, and indeed Linux’s fork+exec was faster than NT’s CreateProcess so I find it incredibly unlikely NT’s omission for fork() was for performance, but more likely to simplify other parts of the NT design.
The suggestion to run a subprocess that calls tcsetpgrp before exec isn’t a bad one, and maybe obviates some of the performance benefits you get from posix_spawn, but it might not be so bad because that subprocess can be a real simple tiny static binary that does what it needs to and calls exec(). One day maybe we won’t have to worry about this.
Another option is to just wait+WIFSTOPPED and then kill+SIGCONT it if it’s supposed to be in the foreground.
Very strange claim. AFAIK it was possible to implement fork on top of Windows NT native API right from the beginning. (I can try to find links.) And early Windows POSIX subsystems (including Interix) actually implemented fork. (This was long before WSL happened.) And Interix actually directly implemented fork on top of Windows NT native API, as opposed to very hacky Cygwin’s fork implementation.
Also IIRC the very first Windows POSIX subsystem happened before posix_spawn was added to POSIX. (Windows had a lot of different official POSIX subsystems authored by Microsoft, WSL is the last one.)
I think you’re thinking of zwCreateSection, but I don’t think this was a win32 API call (or even well-documented), and it takes a careful reading to see how fork could be implemented with it, so I don’t think this is the same as having
fork()
– after all, there’s got to be lots of ways to get a fork-like API out of other things, including:I remembered they claimed they had reasons for not using zwCreateSection but I don’t know enough to know what problems they ran into though.
It’s not quite that clear cut. Modern versions of NT have a thing called a picoprocess, which was originally added for Drawbridge but is now used for WSL. These are basically processes that start (almost) empty. Creating a new process with
CreateProcess[Ex]
creates a new process very quickly, but then maps a huge amount of stuff into it. This is the equivalent ofexecve
+ld-linux.so
running and that’s what takes almost all of the time.Even on Linux,
vfork
instead offork
is faster (especially on pre-Milan x86, where fork needs to IPI other cores for TLB synchronisation).XNU has a lot of extensions to POSIX spawn that make it almost usable. Unfortunately, they’re not implemented anywhere else. The biggest problem with the API is that they constrained it to permit userspace implementations. As such, it is strictly less expressive than vfork + execve. That said, vfork isn’t actually that bad an API. It would be even better if the execve simply returned and vfork didn’t return twice. Then the sequence would be simply vfork, setup, execve, cleanup.
With a language like C++ that supports RAII, you can avoid the footguns of vfork by doing
This ensures that anything that you created in between the setup is cleaned up. I generally use a
std::vector
for theexecve
arguments. This must be declare in the enclosing scope, so it’s cleaned up in the parent. It’s pretty easy to wrap this in a function that takes a lambda and passes it a reference to theargv
andenvp
vectors, and executes it before theexecve
. This ensures that you get the memory management right. As a caller, you just pass a lambda that does any file descriptor opening and so on. The wrapper that I use also takes a vector of file descriptors to inherit, so you can open some files before spawning the child, the do arbitrary additional setup in the child context (including things like entering capability mode or attaching to a jail).So, a few very common cases that would break:
Redirecting stdin/stdout/stderr. How do you preserve the parent’s stdin/stdout/stderr for the cleanup step without also passing it to the child?
Changing UID/GID. Whoops the parents is now no longer root, and can’t change back.
Entering a jail/namespace. Again, the parent is now in that jail, so it break out without also leaving the child with an escape hatch.
Basically anything that locks down the child in some way, will also affect the parent now.
I don’t understand how any of those use cases break. In between
vfork
andexecve
, you are running in the child’s context, just as you are now. You can drop privileges, open / close / dup file descriptors, and so on. The only difference is that you wouldn’t have the behaviour whereexecve
effectivelyjongjmp
s back to thevfork
, you’d just return to running in the parent’s context after you started the child.Ok, so I think I see what you mean now.
At the point of vfork, the kernel creates two processes using the same page tables. One is suspended (parent), and the other isn’t (child).
The child continues running until execve. At that point the child process image is replaced with the executable loaded by exec. The parent process is then resumed but with the register/stack state of the child.
That would actually work pretty nicely.
Exactly. The vfork call just switches out the kernel data structures associated with the running thread but leaves everything else in place, the execve would switch back and doesn’t do any of the saving and restoring of register state that makes vfork a bit exciting. The only major change would be that execve would return to the parent process’ kernel state even in case of failure.
That’s pretty elegant. If I’m ever arsed writing a hobby OS-kernel again, I’m definitely going to try implementing this.
Actually that tweet thread by the author of fish is very good.
This is exactly what Melvin just added, so looks like we can’t use it ? On what platforms?
https://github.com/ksh93/ksh/blob/dev/src/lib/libast/comp/spawnveg.c is a good one to look at; you can see how to use posix_spawn_file_actions_addtcsetpgrp_np and if/when POSIX_SPAWN_TCSETPGROUP shows up you can see how it could be added.
Hm in the Twitter thread by ridiculous_fish he points to this thread:
https://www.mail-archive.com/[email protected]/msg00718.html
The way I’m reading this is that ksh has a bug due to using posix_spawn() and lack of tcsetpgrp(). And they are suggesting that the program being CALLED by the shell can apply a workaround, not the shell itself!
This seems very undesirable.
I think we could use posix_spawn() when job control is off, but not when it’s on.
And I actually wonder if this is the best way to port to Windows? Not sure
posix_spawn is documented in POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/ , specially here: https://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_spawn.html . This manpage contains big “RATIONALE” section. And “SEE ALSO” section with
posix_spawnattr_*
functionsGlibc and musl add more and more optimizations over time, allowing posix_spawn to use vfork (as opposed to fork) in more and more cases. It is quite possible that recent versions of glibc and musl call vfork in all cases.
AFAIK this glibc bug report https://sourceware.org/bugzilla/show_bug.cgi?id=10354 is resolved using patchset, which makes glibc always use vfork/CLONE_VFORK.
Using vfork is not simple. This article explains how hard it is: https://ewontfix.com/7/
I can post code of my
unshare(1)
analog. It seems I can simply add CLONE_VFORK option to list ofclone
options and everything will workThat would be cool! What do you use it for?
I would like to do some container-like stuff without Docker, e.g. I’ve looked at bubblewrap, crun, runc, etc. a little bit
We have a bunch of containers and I want to gradually migrate away from them
I also wonder if on Linux at least a shell should distribute a bubblewrap-like tool ! Although I think security takes a long time to get right on Linux, so we probably don’t want that responsibility
Here is my util, I call it “asjail” for “Askar Safin’s jail”:
https://paste.gg/p/anonymous/4d26975181eb4223b10800911255c951
It is public domain. I’m sorry for Russian comments.
Compile with
gcc -o asjail asjail.c
. My program is x86 Linux specific. Run like so:asjail -p bash
.Program has options
-imnpuU
, which correspond tounshare(1)
options (see its manpage). (Also, actual unshare has option-r
, I have no such cool option.)My program usually requires root privileges, but you can specify
-U
flag, which creates user namespace. So, you can runasjail -pU bash
as normal user, this will create new user namespace and then create PID namespace inside it. (Again:unshare -pr bash
is even better.)But user namespace requires that they should be enabled in kernel. In some distros they are enabled by default, in others - not.
I wrote this util nearly 10 years ago. Back then I wanted some lightweight container solution. I was not aware of
unshare(1)
util.unshare(1)
fully subsumes my util. (Don’t confuse withunshare(2)
syscall.) Also,unshare(1)
is low-level util, it’s lower then bubblewrap, runc, etc.I don’t remember some details in this code, for example I don’t remember why I need this signal mask manipulations.
Today I use docker and I’m happy with it. Not only docker provide isolation, it is also allows you to write Dockerfiles. And partial results in dockerfiles are cached, so you can edit some line in dockerfile, and docker will rebuild exactly what is needed and no more. Dockerfiles are perfect for bug reports, i. e. you can simply send dockerfile instead of “steps to reproduce” section. The only problem with docker is inability to run systemd inside it. I have read that this is solved by podman, but I didn’t test it. Also, dockerfiles are not quite reproducible, because they are often rely on downloading something from internet. I have read that proper solution is Nix, but I didn’t test it
Additional comments on asjail (and other topics).
You need to add
--
to make sure options are processed as part of target command, not asjail itself, i. e.asjail -p -- bash -c 'echo ok'
.asjail was written by careful reading of
clone(2)
manual page.asjail is not complete solution. It doesn’t do chrooting, mounting needed directories (/proc, /dev, etc). So back then 10 years ago I wrote bash script called
asjail-max-1
, which does these additional steps.asjail-max-1
was written by careful reading of http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface and soasjail-max-1
(together withasjail
) can run systemd in container!asjail-max-1
is screen sized bash script.But, unfortunately, to make all this work, you also need C program, which emulates terminal using
forkpty(3)
. So I wrote such a program and I called itpty
. It is some screens sized C program. Togetherasjail
,asjail-max-1
andpty
gives you complete small solution for running containers. In something like 5 screens of code.I can post all this code.
But all this is not needed, because all this is subsumed by existing tools.
asjail
is subsumed byunshare
. Andasjail-max-1+asjail+pty
is subsumed bysystemd-nspawn
.Today I don’t use any of these my tools. When I need to run container I use docker. If I need to run existing file tree I use systemd-nspawn
Also, all these tools I discussed so far are designed for maximal isolation. I. e. to prevent container from accessing host resources. But sometimes you have opposite task, i. e. you want to run some program and give it access to host resources, for example to X server connection, to sound card etc. So I have another script, which is simple wrapper for chroot, which does exactly this: https://paste.gg/p/anonymous/84c3685e200347299cac0dbd23d31bf3
Also a good paper about old POSIX APIs not being a great fit for modern applications (Android and OS X both use non-POSIX IPC throughout, etc.)
POSIX Abstractions in Modern Operating Systems: The Old, the New, and the Missing (2016)
https://roxanageambasu.github.io/publications/eurosys2016posix.pdf
https://news.ycombinator.com/item?id=11652609 (51 comments)
https://lobste.rs/s/jhyzvh/posix_has_become_outdated_2016 (good summary comment)
https://lobste.rs/s/vav0xl/posix_abstractions_modern_oses_old_new (1 comment 6 years ago)
This is not true. Ninja redirects output of command so that outputs of multiple parallel commands don’t mix together
I want to write my own shell some day. And I don’t want to use
fork
there, I will use posix_spawn instead. But this will be tricky. How to launch subshell? Using/proc/self/exe
? What if/proc
is not mounted? I will possibly try to send a patch to Linux kernel to make/proc/self/exe
available even if/proc
is not mounted. Busybox docs contain (or contained in the past) such patch, so theoretically I can simply copy it from there. I can try to find a linkBTW my reading of this thread above
https://lobste.rs/s/smbsd5/fork_road#c_qlextq
is that if you want a shell to have job control (and POSIX requires job control), you should probably use fork() when it’s on.
Also I’m interested in whether posix_spawn() is the best way to port a shell to Windows or not …. not sure what APIs bash uses on Windows.
Is posix_spawn() built on top of Win32? How do they do descriptors, etc. ?
Surprisingly, we can combine posix_spawn speed with needs of docker and systemd-spawn. Let me tell you how.
First of all, let’s notice that merely fork is not enough for systemd-nspawn. You need to also put child in new mount/utc/etc namespace. For this you need
clone
orunshare(2)
.Fortunately,
clone
has flagCLONE_VFORK
, which allows us to get vfork-like behavior, i. e. our program will be faster than with fork.So, to summarize, we have
two optionsone option to combine posix_spawn speed with features systemd-nspawn needs(either one will be enough):clone
with all namespacing flags we need (such as CLONE_NEWNS) and CLONE_VFORKCreate new process using usual posix_spawn and then callunshare
to put the process into new namespaceI didn’t tested any of this, so it may be possible something will go wrong.
Also: I’m author of my own
unshare(1)
util analog (don’t confuse withunshare(2)
syscall). My util doesn’t do any speed up tricks, I just use plainclone
without CLONE_VFORKAlso, I wrote a Rust program, which spawns program using posix_spawn, redirects its in, out and err using posix_spawn_file_actions_adddup2, collects its out and err, waits for finish and reports status
I was reminded of this by the discussion about problems with unix process APIs. In general I like the idea of a capability-oriented process API, which is basically what pidfd is. Then if process-manipulating syscalls require a process capability, operating on other processes is just like operating on your own, which also solves the issue of awkwardly huge
spawn
APIs, as in section 6 “Replacingfork
- Low-level: Cross-process operations.”I think there’s a contradiction in this paper, tho: I think in principle if it’s possible to explicitly donate a share of parts of the current process’s state to a new empty process, then it’s possible to implement
fork
in userland, which is basically what that subsection of section 6 says. But that contradicts section 5 “Implementingfork
- Fork infects an entire system” which claims “an abstraction, fork fails to compose: unless every layer supports fork, it cannot be used.”I don’t think there’s a contradiction. A userland fork would also not compose with other things, but having a feature in one library that doesn’t compose with features in other libraries is not a new problem: it’s inherent to the UNIX shared library model (and a regression from MULTICS).
Modern versions of Linux’s clone syscall allow CLONE_PIDFD flag, which allows to immediately get pidfd!
This problem is solved using io_uring_spawn (which is AFAIK not mainline yet). See https://lwn.net/Articles/908268/ and my comment below: https://lobste.rs/s/smbsd5/fork_road#c_vf4hnp
I’m excited to see if
io_uring_spawn
gets momentum. (https://lwn.net/Articles/908268/)io_uring_spawn is very good thing!!
If you like this article (“fork in the road”), you will like io_uring_spawn, too. I think io_uring_spawn solves all fork problems, while being faster than all other solutions, even faster than vfork and posix_spawn. Slides above include table with time comparison.
Also, “fork in the road” includes this statement:
io_uring_spawn is exactly this design!!!
Zillions of C functions are incompatible with fork, including… printf!!!!! Consider this code:
This code prints “Hello” twice on my machine (Linux x86_64). Because userspace buffer is duplicated
Yup, shells have to be careful to flush() at specific points to avoid this … I definitely hit those bugs in both Python and C++
The article mentions (IIRC) that
fork
interacts badly withsysctl vm.overcommit_memory=2
on Linux. This is very bad, becausesysctl vm.overcommit_memory=2
is what you (from perfectionist view) should do always (unfortunately, this breaks some software in real world). You can read about overcommit here: https://ewontfix.com/3/This paper reads of sour grapes. I think fork() is great, but maybe some other metaphors would be great as well.
Once upon a time, I was working on a DNS server, and after finding it a bit slow, did a:
after binding the sockets and immediately got it running multicore. Yes, I could have rewrote things to allow main() to accept the socket to share, or to have a server process pass out threads, and then I could have used spawn, but I have a hard time being mad at those 14 bytes.
Curious that no one has yet mentioned that the Fuchsia developers were apparently fans of this position, as that OS lacks fork/exec: https://fuchsia.dev/fuchsia-src/concepts/kernel/libc#fork_and_exec
fork() is the only perfect syscall, it’s everything else that sucks.