A fork() in the road

17

A fork() in the road pdf osdev microsoft.com
via fanf 1 year ago | caches
Archive.org Archive.today Ghostarchive
| 41 comments

41

1. 3
  andyc edited 1 year ago | link
  Yes this is a great paper. It’s a perspective from kernel implementers, and the summary is that fork() is only good for shells these days, not other apps like servers :)
  
  It reminds me that I bookmarked a comment from the creator of Ninja about using posix_spawn() on OS X since 2016 for performance:
  
  https://news.ycombinator.com/item?id=30502392
  
  I think basically Ninja can do it because it only spawns simple processes like the compiler.
  
  It doesn’t do anything between fork() and exec(), which is how a shell sets up pipelines, redirects, and does job control (just implemented by Melvin in Oil, with setpgid() etc. )
  
  Also, anything like OCI / Docker / systemd-nspawn does a bunch of stuff between fork() and exec()
  
  It would be cool if someone writes a demo of that functionality without fork(), makes a blog post, etc.
  
  Related from the Hacker News thread:
  
  https://github.com/famzah/popen-noshell
  
  https://github.com/NobodyXu/aspawn/
  
  As a way to help alternative kernel implementers, I’d be open to making https://www.oilshell.org optionally use some different APIs, but I’m not sure of the details right now
  1. 3
    geocar 1 year ago | link
    
    It doesn’t do anything between fork() and exec(), which is how a shell sets up pipelines, redirects, and does job control (just implemented by Melvin in Oil, with setpgid() etc. )
    
    Most of those things can be done with posix_spawn_file_actions_t.
    
    posix_spawn_file_actions_t a,b; int p[6];pid_t c[2]; pipe(p),pipe(p+2),pipe(p+4); posix_spawn_file_actions_adddup2(&a,*p,0); posix_spawn_file_actions_adddup2(&a,p[3],1) posix_spawn_file_actions_adddup2(&b,p[2],0); posix_spawn_file_actions_adddup2(&b,p[5],1); posix_spawn(c,"a",&a,NULL,...); posix_spawn(c+1,"b",&b,NULL,...); // write to p[1] to put into head of pipeline (a) read the end of the pipelline (b) from p[4] // wait for *c(a) and c[1](b) for status code.
    
    You can get setpgid with POSIX_SPAWN_SETPGROUP, and the signal mask with posix_spawnattr_setsigmask.
    1. 1
      
      andyc edited 1 year ago | link
      
      Oh cool, I didn’t know that .. Are there any docs or books with some history / background on posix_spawn()?
      
      When I google I get man pages, but they sort of tell “what” and not “why”
      
      And random tweets, the poorest medium for technical information :-( https://twitter.com/ridiculous_fish/status/1232889391531491329?lang=en
      
      I guess this kind of thing is what I’m looking for
      
      4/ Linux does posix_spawn in user space, on top of fork - but sometimes vfork (an even faster, more dangerous fork). The vfork usage is what can make posix_spawn faster on Linux.
      1. 6
        
        geocar 1 year ago | link
        
        and not “why”
        
        The “why” was largely political, IMO: The spawn() family of functions were introduced into POSIX so that Windows NT could get a POSIX certification and win government contracts, since it had spawn() and not fork().
        
        Tannenbaum’s Modern Operating Systems was written around this time, and you might find its discussion of process-spawning APIs interesting: He doesn’t mention performance, and indeed Linux’s fork+exec was faster than NT’s CreateProcess so I find it incredibly unlikely NT’s omission for fork() was for performance, but more likely to simplify other parts of the NT design.
        
        I guess this kind of thing is what I’m looking for
        
        The suggestion to run a subprocess that calls tcsetpgrp before exec isn’t a bad one, and maybe obviates some of the performance benefits you get from posix_spawn, but it might not be so bad because that subprocess can be a real simple tiny static binary that does what it needs to and calls exec(). One day maybe we won’t have to worry about this.
        
        Another option is to just wait+WIFSTOPPED and then kill+SIGCONT it if it’s supposed to be in the foreground.
        
        2
        
        safinaskar 1 year ago | link
        
        since it had spawn() and not fork()
        
        Very strange claim. AFAIK it was possible to implement fork on top of Windows NT native API right from the beginning. (I can try to find links.) And early Windows POSIX subsystems (including Interix) actually implemented fork. (This was long before WSL happened.) And Interix actually directly implemented fork on top of Windows NT native API, as opposed to very hacky Cygwin’s fork implementation.
        
        Also IIRC the very first Windows POSIX subsystem happened before posix_spawn was added to POSIX. (Windows had a lot of different official POSIX subsystems authored by Microsoft, WSL is the last one.)
        
        1
        
        geocar 1 year ago | link
        
        AFAIK it was possible to implement fork on top of Windows NT native API right from the beginning.
        
        I think you’re thinking of zwCreateSection, but I don’t think this was a win32 API call (or even well-documented), and it takes a careful reading to see how fork could be implemented with it, so I don’t think this is the same as having fork() – after all, there’s got to be lots of ways to get a fork-like API out of other things, including:
        
        as opposed to very hacky Cygwin’s fork implementation.
        
        I remembered they claimed they had reasons for not using zwCreateSection but I don’t know enough to know what problems they ran into though.
        
        1
        
        david_chisnall 1 year ago | link
        
        He doesn’t mention performance, and indeed Linux’s fork+exec was faster than NT’s CreateProcess so I find it incredibly unlikely NT’s omission for fork() was for performance, but more likely to simplify other parts of the NT design.
        
        It’s not quite that clear cut. Modern versions of NT have a thing called a picoprocess, which was originally added for Drawbridge but is now used for WSL. These are basically processes that start (almost) empty. Creating a new process with CreateProcess[Ex] creates a new process very quickly, but then maps a huge amount of stuff into it. This is the equivalent of execve + ld-linux.so running and that’s what takes almost all of the time.
        
        Even on Linux, vfork instead of fork is faster (especially on pre-Milan x86, where fork needs to IPI other cores for TLB synchronisation).
      2. 3
        
        david_chisnall 1 year ago | link
        
        XNU has a lot of extensions to POSIX spawn that make it almost usable. Unfortunately, they’re not implemented anywhere else. The biggest problem with the API is that they constrained it to permit userspace implementations. As such, it is strictly less expressive than vfork + execve. That said, vfork isn’t actually that bad an API. It would be even better if the execve simply returned and vfork didn’t return twice. Then the sequence would be simply vfork, setup, execve, cleanup.
        
        With a language like C++ that supports RAII, you can avoid the footguns of vfork by doing
        
        pid_t pid = vfork(); if (pid == 0) { { // Set up the child } execve(…); pid = -1; }
        
        This ensures that anything that you created in between the setup is cleaned up. I generally use a std::vector for the execve arguments. This must be declare in the enclosing scope, so it’s cleaned up in the parent. It’s pretty easy to wrap this in a function that takes a lambda and passes it a reference to the argv and envp vectors, and executes it before the execve . This ensures that you get the memory management right. As a caller, you just pass a lambda that does any file descriptor opening and so on. The wrapper that I use also takes a vector of file descriptors to inherit, so you can open some files before spawning the child, the do arbitrary additional setup in the child context (including things like entering capability mode or attaching to a jail).
        
        1
        
        joed edited 1 year ago | link
        
        It would be even better if the execve simply returned and vfork didn’t return twice. Then the sequence would be simply vfork, setup, execve, cleanup.
        
        So, a few very common cases that would break:
        
        Redirecting stdin/stdout/stderr. How do you preserve the parent’s stdin/stdout/stderr for the cleanup step without also passing it to the child?
        
        Changing UID/GID. Whoops the parents is now no longer root, and can’t change back.
        
        Entering a jail/namespace. Again, the parent is now in that jail, so it break out without also leaving the child with an escape hatch.
        
        Basically anything that locks down the child in some way, will also affect the parent now.
        
        2
        
        david_chisnall 1 year ago | link
        
        I don’t understand how any of those use cases break. In between vfork and execve, you are running in the child’s context, just as you are now. You can drop privileges, open / close / dup file descriptors, and so on. The only difference is that you wouldn’t have the behaviour where execve effectively jongjmps back to the vfork, you’d just return to running in the parent’s context after you started the child.
        
        1
        
        joed 1 year ago | link
        
        Ok, so I think I see what you mean now.
        
        At the point of vfork, the kernel creates two processes using the same page tables. One is suspended (parent), and the other isn’t (child).
        
        The child continues running until execve. At that point the child process image is replaced with the executable loaded by exec. The parent process is then resumed but with the register/stack state of the child.
        
        That would actually work pretty nicely.
        
        1
        
        david_chisnall 1 year ago | link
        
        Exactly. The vfork call just switches out the kernel data structures associated with the running thread but leaves everything else in place, the execve would switch back and doesn’t do any of the saving and restoring of register state that makes vfork a bit exciting. The only major change would be that execve would return to the parent process’ kernel state even in case of failure.
        
        1
        
        joed 1 year ago | link
        
        That’s pretty elegant. If I’m ever arsed writing a hobby OS-kernel again, I’m definitely going to try implementing this.
      3. 2
        
        andyc 1 year ago | link
        
        Actually that tweet thread by the author of fish is very good.
        
        This is exactly what Melvin just added, so looks like we can’t use it ? On what platforms?
        
        8/ What’s missing? One hole is the syscall you’ve never heard of: tcsetpgrp(), which hands-off tty ownership. The “correct” usage with fork is a benign race, where both the parent (tty donator) and child (tty inheritor) request that the child own the tty.
        
        9/ There is no “correct” tcsetpgrp usage with posix_spawn: no way to coax the child into claiming the tty. This means, when job control is on, you may start a process which immediately stops (SIGTTIN) or otherwise. Here’s ksh getting busted: https://mail-archive.com/ast-developers
        
        1
        
        geocar 1 year ago | link
        
        so looks like we can’t use it ?
        
        https://github.com/ksh93/ksh/blob/dev/src/lib/libast/comp/spawnveg.c is a good one to look at; you can see how to use posix_spawn_file_actions_addtcsetpgrp_np and if/when POSIX_SPAWN_TCSETPGROUP shows up you can see how it could be added.
        
        1
        
        andyc 1 year ago | link
        
        Hm in the Twitter thread by ridiculous_fish he points to this thread:
        
        https://www.mail-archive.com/[email protected]/msg00718.html
        
        The way I’m reading this is that ksh has a bug due to using posix_spawn() and lack of tcsetpgrp(). And they are suggesting that the program being CALLED by the shell can apply a workaround, not the shell itself!
        
        This seems very undesirable.
        
        I think we could use posix_spawn() when job control is off, but not when it’s on.
        
        And I actually wonder if this is the best way to port to Windows? Not sure
      4. 1
        
        safinaskar 1 year ago | link
        
        posix_spawn is documented in POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/ , specially here: https://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_spawn.html . This manpage contains big “RATIONALE” section. And “SEE ALSO” section with posix_spawnattr_* functions
      5. 1
        
        safinaskar 1 year ago | link
        
        Linux does posix_spawn in user space, on top of fork - but sometimes vfork
        
        Glibc and musl add more and more optimizations over time, allowing posix_spawn to use vfork (as opposed to fork) in more and more cases. It is quite possible that recent versions of glibc and musl call vfork in all cases.
        
        AFAIK this glibc bug report https://sourceware.org/bugzilla/show_bug.cgi?id=10354 is resolved using patchset, which makes glibc always use vfork/CLONE_VFORK.
        
        Using vfork is not simple. This article explains how hard it is: https://ewontfix.com/7/
  2. 2
    
    safinaskar 1 year ago | link
    
    I can post code of my unshare(1) analog. It seems I can simply add CLONE_VFORK option to list of clone options and everything will work
    1. 1
      
      andyc 1 year ago | link
      
      That would be cool! What do you use it for?
      
      I would like to do some container-like stuff without Docker, e.g. I’ve looked at bubblewrap, crun, runc, etc. a little bit
      
      We have a bunch of containers and I want to gradually migrate away from them
      
      I also wonder if on Linux at least a shell should distribute a bubblewrap-like tool ! Although I think security takes a long time to get right on Linux, so we probably don’t want that responsibility
      1. 1
        
        safinaskar 1 year ago | link
        
        Here is my util, I call it “asjail” for “Askar Safin’s jail”:
        
        https://paste.gg/p/anonymous/4d26975181eb4223b10800911255c951
        
        It is public domain. I’m sorry for Russian comments.
        
        Compile with gcc -o asjail asjail.c. My program is x86 Linux specific. Run like so: asjail -p bash.
        
        Program has options -imnpuU, which correspond to unshare(1) options (see its manpage). (Also, actual unshare has option -r, I have no such cool option.)
        
        My program usually requires root privileges, but you can specify -U flag, which creates user namespace. So, you can run asjail -pU bash as normal user, this will create new user namespace and then create PID namespace inside it. (Again: unshare -pr bash is even better.)
        
        But user namespace requires that they should be enabled in kernel. In some distros they are enabled by default, in others - not.
        
        I wrote this util nearly 10 years ago. Back then I wanted some lightweight container solution. I was not aware of unshare(1) util. unshare(1) fully subsumes my util. (Don’t confuse with unshare(2) syscall.) Also, unshare(1) is low-level util, it’s lower then bubblewrap, runc, etc.
        
        I don’t remember some details in this code, for example I don’t remember why I need this signal mask manipulations.
        
        Today I use docker and I’m happy with it. Not only docker provide isolation, it is also allows you to write Dockerfiles. And partial results in dockerfiles are cached, so you can edit some line in dockerfile, and docker will rebuild exactly what is needed and no more. Dockerfiles are perfect for bug reports, i. e. you can simply send dockerfile instead of “steps to reproduce” section. The only problem with docker is inability to run systemd inside it. I have read that this is solved by podman, but I didn’t test it. Also, dockerfiles are not quite reproducible, because they are often rely on downloading something from internet. I have read that proper solution is Nix, but I didn’t test it
      2. 1
        
        safinaskar 1 year ago | link
        
        Additional comments on asjail (and other topics).
        
        You need to add -- to make sure options are processed as part of target command, not asjail itself, i. e. asjail -p -- bash -c 'echo ok'.
        
        asjail was written by careful reading of clone(2) manual page.
        
        asjail is not complete solution. It doesn’t do chrooting, mounting needed directories (/proc, /dev, etc). So back then 10 years ago I wrote bash script called asjail-max-1, which does these additional steps. asjail-max-1 was written by careful reading of http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface and so asjail-max-1 (together with asjail) can run systemd in container! asjail-max-1 is screen sized bash script.
        
        But, unfortunately, to make all this work, you also need C program, which emulates terminal using forkpty(3). So I wrote such a program and I called it pty. It is some screens sized C program. Together asjail, asjail-max-1 and pty gives you complete small solution for running containers. In something like 5 screens of code.
        
        I can post all this code.
        
        But all this is not needed, because all this is subsumed by existing tools. asjail is subsumed by unshare. And asjail-max-1+asjail+pty is subsumed by systemd-nspawn.
        
        Today I don’t use any of these my tools. When I need to run container I use docker. If I need to run existing file tree I use systemd-nspawn
      3. 1
        
        safinaskar 1 year ago | link
        
        Also, all these tools I discussed so far are designed for maximal isolation. I. e. to prevent container from accessing host resources. But sometimes you have opposite task, i. e. you want to run some program and give it access to host resources, for example to X server connection, to sound card etc. So I have another script, which is simple wrapper for chroot, which does exactly this: https://paste.gg/p/anonymous/84c3685e200347299cac0dbd23d31bf3
  3. 1
    
    andyc edited 1 year ago | link
    
    Also a good paper about old POSIX APIs not being a great fit for modern applications (Android and OS X both use non-POSIX IPC throughout, etc.)
    
    POSIX Abstractions in Modern Operating Systems: The Old, the New, and the Missing (2016)
    
    https://roxanageambasu.github.io/publications/eurosys2016posix.pdf
    
    https://news.ycombinator.com/item?id=11652609 (51 comments)
    
    https://lobste.rs/s/jhyzvh/posix_has_become_outdated_2016 (good summary comment)
    
    https://lobste.rs/s/vav0xl/posix_abstractions_modern_oses_old_new (1 comment 6 years ago)
  4. 1
    
    safinaskar 1 year ago | link
    
    I think basically Ninja can do it because it only spawns simple processes like the compiler.
    
    It doesn’t do anything between fork() and exec(), which is how a shell sets up pipelines, redirects
    
    This is not true. Ninja redirects output of command so that outputs of multiple parallel commands don’t mix together
  5. 1
    
    safinaskar 1 year ago | link
    
    I want to write my own shell some day. And I don’t want to use fork there, I will use posix_spawn instead. But this will be tricky. How to launch subshell? Using /proc/self/exe? What if /proc is not mounted? I will possibly try to send a patch to Linux kernel to make /proc/self/exe available even if /proc is not mounted. Busybox docs contain (or contained in the past) such patch, so theoretically I can simply copy it from there. I can try to find a link
    1. 1
      
      andyc 1 year ago | link
      
      BTW my reading of this thread above
      
      https://lobste.rs/s/smbsd5/fork_road#c_qlextq
      
      is that if you want a shell to have job control (and POSIX requires job control), you should probably use fork() when it’s on.
      
      Also I’m interested in whether posix_spawn() is the best way to port a shell to Windows or not …. not sure what APIs bash uses on Windows.
      
      Is posix_spawn() built on top of Win32? How do they do descriptors, etc. ?
  6. 1
    safinaskar edited 1 year ago | link
    
    Also, anything like OCI / Docker / systemd-nspawn does a bunch of stuff between fork() and exec()
    
    Surprisingly, we can combine posix_spawn speed with needs of docker and systemd-spawn. Let me tell you how.
    
    First of all, let’s notice that merely fork is not enough for systemd-nspawn. You need to also put child in new mount/utc/etc namespace. For this you need clone or unshare(2).
    
    Fortunately, clone has flag CLONE_VFORK, which allows us to get vfork-like behavior, i. e. our program will be faster than with fork.
    
    So, to summarize, we have ~~two options~~ one option to combine posix_spawn speed with features systemd-nspawn needs ~~(either one will be enough)~~:
    
    Use clone with all namespacing flags we need (such as CLONE_NEWNS) and CLONE_VFORK
    
    ~~Create new process using usual posix_spawn and then call unshare to put the process into new namespace~~
    
    I didn’t tested any of this, so it may be possible something will go wrong.
    
    Also: I’m author of my own unshare(1) util analog (don’t confuse with unshare(2) syscall). My util doesn’t do any speed up tricks, I just use plain clone without CLONE_VFORK
  7. 1
    
    safinaskar 1 year ago | link
    
    Also, I wrote a Rust program, which spawns program using posix_spawn, redirects its in, out and err using posix_spawn_file_actions_adddup2, collects its out and err, waits for finish and reports status
2. 2
  
  fanf 1 year ago | link
  
  I was reminded of this by the discussion about problems with unix process APIs. In general I like the idea of a capability-oriented process API, which is basically what pidfd is. Then if process-manipulating syscalls require a process capability, operating on other processes is just like operating on your own, which also solves the issue of awkwardly huge spawn APIs, as in section 6 “Replacing fork - Low-level: Cross-process operations.”
  
  I think there’s a contradiction in this paper, tho: I think in principle if it’s possible to explicitly donate a share of parts of the current process’s state to a new empty process, then it’s possible to implement fork in userland, which is basically what that subsection of section 6 says. But that contradicts section 5 “Implementing fork - Fork infects an entire system” which claims “an abstraction, fork fails to compose: unless every layer supports fork, it cannot be used.”
  1. 1
    
    david_chisnall 1 year ago | link
    
    I don’t think there’s a contradiction. A userland fork would also not compose with other things, but having a feature in one library that doesn’t compose with features in other libraries is not a new problem: it’s inherent to the UNIX shared library model (and a regression from MULTICS).
  2. 1
    
    safinaskar 1 year ago | link
    
    basically what pidfd is
    
    Modern versions of Linux’s clone syscall allow CLONE_PIDFD flag, which allows to immediately get pidfd!
  3. 1
    
    safinaskar 1 year ago | link
    
    operating on other processes is just like operating on your own, which also solves the issue of awkwardly huge spawn APIs, as in section 6 “Replacing fork - Low-level: Cross-process operations.”
    
    This problem is solved using io_uring_spawn (which is AFAIK not mainline yet). See https://lwn.net/Articles/908268/ and my comment below: https://lobste.rs/s/smbsd5/fork_road#c_vf4hnp
3. 2
  
  aidancully 1 year ago | link
  
  I’m excited to see if io_uring_spawn gets momentum. (https://lwn.net/Articles/908268/)
  1. 3
    
    safinaskar 1 year ago | link
    
    io_uring_spawn is very good thing!!
    
    If you like this article (“fork in the road”), you will like io_uring_spawn, too. I think io_uring_spawn solves all fork problems, while being faster than all other solutions, even faster than vfork and posix_spawn. Slides above include table with time comparison.
    
    Also, “fork in the road” includes this statement:
    
    …clean-slate designs [e.g., 40, 43] have demonstrated an alternative model where system calls that modify per-process state are not constrained to merely the current process, but rather can manipulate any process to which the caller has access. This yields the flexibility and orthogonality of the fork/exec model, without most of its drawbacks: a new process starts as an empty address space, and an advanced user may manipulate it in a piecemeal fashion, populating its address-space and kernel context prior to execution, without needing to clone the parent nor run code in the context of the child. ExOS [43] implemented fork in user-mode atop such a primitive. Retrofitting cross-process APIs into Unix seems at first glance challenging, but may also be productive for future research.
    
    io_uring_spawn is exactly this design!!!
4. 2
  safinaskar 1 year ago | link
  Zillions of C functions are incompatible with fork, including… printf!!!!! Consider this code:
  
  #include <stdio.h> #include <unistd.h> int main () { printf ("Hello"); fork (); }
  
  This code prints “Hello” twice on my machine (Linux x86_64). Because userspace buffer is duplicated
  1. 1
    
    andyc 1 year ago | link
    
    Yup, shells have to be careful to flush() at specific points to avoid this … I definitely hit those bugs in both Python and C++
5. 2
  
  safinaskar 1 year ago | link
  
  The article mentions (IIRC) that fork interacts badly with sysctl vm.overcommit_memory=2 on Linux. This is very bad, because sysctl vm.overcommit_memory=2 is what you (from perfectionist view) should do always (unfortunately, this breaks some software in real world). You can read about overcommit here: https://ewontfix.com/3/
6. 1
  geocar edited 1 year ago | link
  This paper reads of sour grapes. I think fork() is great, but maybe some other metaphors would be great as well.
  
  Once upon a time, I was working on a DNS server, and after finding it a bit slow, did a:
  
  fork(),fork();
  
  after binding the sockets and immediately got it running multicore. Yes, I could have rewrote things to allow main() to accept the socket to share, or to have a server process pass out threads, and then I could have used spawn, but I have a hard time being mad at those 14 bytes.
7. 1
  
  codeslinger 1 year ago | link
  
  Curious that no one has yet mentioned that the Fuchsia developers were apparently fans of this position, as that OS lacks fork/exec: https://fuchsia.dev/fuchsia-src/concepts/kernel/libc#fork_and_exec
8. 0
  
  donio 1 year ago | link
  
  fork() is the only perfect syscall, it’s everything else that sucks.

Stories with similar links:

A fork() in the road (2019) via dbremner 3 years ago | 26 points | 15 comments