The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 7:53 UTC (Wed) by ibukanov (guest, #3942)
In reply to: The "too small to fail" memory-allocation rule by zblaxell
Parent article: The "too small to fail" memory-allocation rule

OOM killer is a consequence of memory overcommit which in turn originated in copy-on-write semantics of the fork.

For example, Linux allows to fork a process even if its image has more then halve the memory on the system. The assumption is that the child process most likely would either use exec soon or parent/child would use the allocated memory in read-only way. Now consider what should happen when the processes start to write to the memory. That may lead to OOM errors. Now, who should be blamed for it? The child or parent? In addition, applications are typically not prepared to deal with a situation when an arbitrary write may trigger OOM. So OOM killer is pretty reasonable solution.

Compare that with Windows where overcommit is not available (lack of fork in Windows API allows for that). Immediately memory accounting becomes simple and OS can guarantee that if a process has a pointer to writable memory, then memory is always available for writing.

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 9:23 UTC (Wed) by epa (subscriber, #39769) [Link] (10 responses)

fork-exec is one of the aspects of classical Unix which looks really neat in a textbook but hasn't held up quite so well in the real world. Remembering that 'things should be as simple as possible, but no simpler', it is just a bit too simple. POSIX does define spawn functions which userspace code might call instead.

Essentially, when you fork() the kernel has no way of knowing that you are just about to call exec() immediately afterwards, so it has to either reserve space for a complete copy of all your process's memory, or end up overcommitting and resorting to an OOM killer if it guessed wrong. If userland can give the kernel a bit more information then the kernel can do its job better.

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 9:35 UTC (Wed) by roc (subscriber, #30627) [Link] (9 responses)

One great feature of the fork()/exec() model is that you can set up the child's environment by running code in the child before exec --- setting resource limits, manipulating file descriptors, dropping privileges, etc. It's easy to do this in a race-free way. With Windows' CreateProcess you can't do this, so it takes 10 parameters including a flags word with 16 flags and a STARTUPINFO struct with over a dozen members, and it's still less flexible than fork()/exec().

Also, copy-on-write fork()ing has other valuable uses. In rr we use it to create checkpoints very efficiently.

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 9:54 UTC (Wed) by epa (subscriber, #39769) [Link] (3 responses)

This is true. But the manipulation of file descriptors shouldn't require a complete copy of the parent process's memory. It would be handy if you could fork passing a buffer of memory and a function pointer. The child process will be given a copy of that buffer and will jump to the function given. Then you have some flexibility in what you can do before exec() but you can still be frugal in memory usage without relying on overcommit.

The "too small to fail" memory-allocation rule

Posted Dec 25, 2014 18:44 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

> The child process will be given a copy of that buffer and will jump to the function given

vfork does what you want. The "child" shares memory with the parent until it calls exec, so you avoid not only the commit charge catastrophe of copy-on-write fork, but also gain a significant performance boost from not having to copy the page tables.

Re: shouldn't require a complete copy of the parent process's memory

Posted Dec 25, 2014 22:18 UTC (Thu) by ldo (guest, #40946) [Link] (1 responses)

That issue was solved a long time ago, which is why the vfork(2) hack is obsolete nowadays.

Re: shouldn't require a complete copy of the parent process's memory

Posted Dec 26, 2014 7:33 UTC (Fri) by epa (subscriber, #39769) [Link]

Sorry what issue are you referring to? The conjecture so far is that there is a problem, since a forked process gets a copy of its parent's address space, which requires either reserving enough memory (RAM or swap) to provide that, or else overcommitting and crossing your fingers (with OOM killer for when you get it wrong). It is true that copy-on-write means the kernel doesn't have to copy all the pages straight away, but that is just a time saving; it doesn't resolve the underlying issue of needing overcommit.

vfork() doesn't make a copy of the address space and so doesn't require either over-caution or over-committing. But it has other limitations.

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 10:18 UTC (Wed) by ibukanov (guest, #3942) [Link] (4 responses)

> One great feature of the fork()/exec() model is that you can set up the child's environment by running code in the child before exec

To customize the child before exec one does not need full fork, vfork is enough and does not require overcommit.

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 11:34 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (3 responses)

Only in Linux. The POSIX description is "you can store the return value from vfork(), then you have to call _exit or exec".

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 12:31 UTC (Wed) by ibukanov (guest, #3942) [Link] (1 responses)

Linux semantic of vfork is one of the most usable as it blocks only the thread in parent that calls it. Thus one can get rather safe and efficient alternative to fork/exec by creating a new thread, calling vfork from it, preparing the child and calling the exec. Compare that with, say, FreeBSD or Solaris that, according to the manual pages, suspends the whole parent process during the vfork call.

The "too small to fail" memory-allocation rule

Posted Dec 30, 2014 0:04 UTC (Tue) by klossner (subscriber, #30046) [Link]

By "create a new thread" do you mean call pthread_create? That just ends up in do_fork(), which creates a new process with which to call vfork(), which ends up in do_fork() to create another new process. Safe, okay, but not really efficient.

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 14:27 UTC (Wed) by justincormack (subscriber, #70439) [Link]

Most versions allow you to do more than this though, not just the Linux one. Makes it all rather non portable though.

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 14:12 UTC (Wed) by patrakov (subscriber, #97174) [Link]

This phrase is not 100% accurate:

"overcommit is not available (lack of fork in Windows API allows for that)"

It is not the lack of fork() that makes overcommit "unneeded", but an explicit decision. Windows MapViewOfFile() API allows creation of copy-on-write mappings (which are usually the basis for overcommit) through the FILE_MAP_COPY flag. Although, as documented, Windows does always reserve the whole size of the memory region in the pagefile.

See http://msdn.microsoft.com/en-us/library/windows/desktop/a...

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 15:33 UTC (Wed) by dumain (subscriber, #82016) [Link] (6 responses)

The copy-on-write fork semantics don't have to lead to the OOM killer. Selecting strategy 2 in /proc/sys/vm/overcommit_memory should allow a linux box to overcommit real memory without overcommiting virtual memory. The penalty for actually using the memory then becomes swapping rather than random process death.

The "too small to fail" memory-allocation rule

Posted Dec 25, 2014 18:50 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (5 responses)

overcommit=2 is not usable in practice on general-purpose systems: MAP_NORESERVE is ignored in that mode. Many programs (like the JVM) reserve large chunks of address space on startup and internally allocate out of that. Because the kernel ignores [1] MAP_NORESERVE, these address space reservations become actual *commit charge* claims on the system and require resource allocations far in excess of what's actually needed even ignoring COW fork considerations.

MAP_NORESERVE being ignored when overcommit=2 isn't a "gotcha". It's fundamental brokenness.

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-ac...

The "too small to fail" memory-allocation rule

Posted Dec 25, 2014 19:04 UTC (Thu) by fw (subscriber, #26023) [Link] (4 responses)

OpenJDK has been fixed and uses PROT_NONE mappings to reserve uncommitted address space (which is then committed with mprotect calls, as needed).

MAP_NORESERVE is not named appropriately, but it does what it does. The problem is that programmers do not know about overcommit mode 2, do not test with it, and hence never realize the need for the PROT_NONE approach.

The "too small to fail" memory-allocation rule

Posted Dec 25, 2014 19:58 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (3 responses)

Agreed on the lack of testing --- but what bugs me is that MAP_NORESERVE is documented as doing the right thing, and doesn't. That means that programs written by the few well-intentioned developers are aware of overcommit issues frequently don't end up working correctly.

Sadly, the commit charge implications of MAP_NORESERVE are documented but silently broken, but the commit charge implications of PROT_NONE are undocumented and in theory mutable in future releases. All this gives mmap(2) a Rusty Russel score of around -4.

The "too small to fail" memory-allocation rule

Posted Dec 25, 2014 20:14 UTC (Thu) by fw (subscriber, #26023) [Link] (2 responses)

Which documentation? The manual page talks about error reporting through SIGSEGV, which should be sufficiently discouraging to anyone who actually reads it.

The "too small to fail" memory-allocation rule

Posted Dec 25, 2014 20:23 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (1 responses)

The man page says this about MAP_NORESERVE:

Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. See also the discussion of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before 2.6, this flag had effect only for private writable mappings.

That looks a lot like a a flag for mapping that doesn't reserve commit charge. There's no mention of the flag not actually working; there's no reference to better techniques. MAP_NORESERVE is just an attractive nuisance.

There's just tremendous confusion over exactly Linux does with commit charge even among people well-versed in memory management fundamentals. There's no clarity in the API either. MAP_NORESERVE is dangerous because it's the only publicly-documented knob for managing commit charge and it's a broken knob nobody is interested in fixing.

(And no, SIGSEGV is not something that will discourage use. Those who know what they're doing can write code that recovers perfectly safely from SIGSEGV.)

The "too small to fail" memory-allocation rule

Posted Apr 18, 2019 6:36 UTC (Thu) by thestinger (guest, #91827) [Link]

In case anyone comes across this old comment, the official documentation is at https://www.kernel.org/doc/Documentation/vm/overcommit-ac... and is accurate. It covers the rules properly.

The linux-man-pages project is not the official kernel documentation and has often been inaccurate about important things for many years. MAP_NORESERVE doesn't work as it describes at all.

It has no impact in the full overcommit mode or the memory accounting mode, which makes sense. It only has an impact in the heuristic overcommit mode, where it disables the heuristic for causing immediate failure.

Note how a read-only (or PROT_NONE) mapping with not data has no accounting cost:

For an anonymous or /dev/zero map
SHARED - size of mapping
PRIVATE READ-only - 0 cost (but of little use)
PRIVATE WRITABLE - size of mapping per instance

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 22:24 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

That's not exactly true. Windows supports overcommit just fine. There are two main ways to overcommit:

1) Map a file with copy-on-write semantics from several processes. Windows does this just fine with the MapViewOfFile function.

2) Reserve a large amount of virtual RAM but don't actually allocate physical pages for it. It's also supported with MEM_RESERVE flag for the VirtualAllocEx function ( http://msdn.microsoft.com/en-us/library/windows/desktop/a... )

The main distinction is that Windows programs by default try to avoid OOM situations.

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 23:32 UTC (Wed) by ibukanov (guest, #3942) [Link]

These are not the overcommit in the Linux sense.

On Windows after MEM_RESERVE the memory is not available. One has to explicitly commit it using MEM_COMMIT. As for copy-on-write for mmap files Windows explicitly reserves the allocated memory in advance. In both cases Windows ensure as long as a program has a pointer to a writable memory, that memory is always available. That is, on Windows OOM failure can only happen during allocation calls, not during an arbitrary write in a program as it happens on Linux

The "too small to fail" memory-allocation rule

Posted Dec 25, 2014 19:17 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

Your comment is incorrect according to a conventional understanding of "overcommit". The misunderstanding probably stems from a general confusion even among very senior Linux developers between allocating address space and allocating memory. In Linux, programs call mmap and expect that operations on the memory returned will never[2] fail. There is no distinction in practice [1] between setting aside a N pages of virtual addresses and reserving storage for N distinct pages of information.

In NT, the operations are separate. A program can set aside N pages of its address space, but it's only when it commits some of those pages that the kernel guarantees that it will be able to provide M, M<=N, distinct pages of information. After a commit operation succeeds, the system guarantees that it will be able to provide the requested pages. There is no OOM killer because the kernel can never work itself into a position where one might be necessary. While it's true that a process can reserve more address space than memory exists on the system, in order to use that memory, it must first make a commit system call, and *that call* can fail. That's not overcommit. That's sane, strict accounting. Your second point is based on a misunderstanding.

Your first point is also based on a misunderstanding. If two processes have mapped writable sections of a file, these mappings are either shared or private (and copy-on-write). Shared mappings do not incur a commit charge overhead because the pages in file-backed shared mappings are backed by the mapped files themselves. Private mappings are copy-on-write, but the entire commit charge for a copy-on-write mapping is assessed *at the time the mapping is created*, and operations that create file mappings can fail. Once they succeed, COW mappings are as committed as any other private, pagefile-backed mapping of equal size. Again, no overcommit. Just regular strict accounting.

From MSDN:

"When copy-on-write access is specified, the system and process commit charge taken is for the entire view because the calling process can potentially write to every page in the view, making all pages private. The contents of the new page are never written back to the original file and are lost when the view is unmapped."

The key problem with the Linux scheme is that, in NT terms, all mappings are SEC_RESERVE and are made SEC_COMMIT lazily on first access, and the penalty for a failed commit operation is sudden death of your process or a randomly chosen other process on the system. IMHO, Linux gets all this tragically wrong, and NT gets it right.

[1] Yes MAP_NORESERVE exists. Few people use it. Why bother? It's broken anyway, especially with overcommit off, when you most care about MAP_NORESERVE in the first place!

[2] Sure, the OOM killer might run in response to a page fault, but the result will either be the death of some *other* process or the death of the process performing the page fault. Either way, that process never observes a failure, in the latter case because it's dead already. Let's ignore file-backed memory on volatile media too.