The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
Posted Dec 24, 2014 7:53 UTC (Wed) by ibukanov (guest, #3942)In reply to: The "too small to fail" memory-allocation rule by zblaxell
Parent article: The "too small to fail" memory-allocation rule
For example, Linux allows to fork a process even if its image has more then halve the memory on the system. The assumption is that the child process most likely would either use exec soon or parent/child would use the allocated memory in read-only way. Now consider what should happen when the processes start to write to the memory. That may lead to OOM errors. Now, who should be blamed for it? The child or parent? In addition, applications are typically not prepared to deal with a situation when an arbitrary write may trigger OOM. So OOM killer is pretty reasonable solution.
Compare that with Windows where overcommit is not available (lack of fork in Windows API allows for that). Immediately memory accounting becomes simple and OS can guarantee that if a process has a pointer to writable memory, then memory is always available for writing.
Posted Dec 24, 2014 9:23 UTC (Wed)
by epa (subscriber, #39769)
[Link] (10 responses)
Essentially, when you fork() the kernel has no way of knowing that you are just about to call exec() immediately afterwards, so it has to either reserve space for a complete copy of all your process's memory, or end up overcommitting and resorting to an OOM killer if it guessed wrong. If userland can give the kernel a bit more information then the kernel can do its job better.
Posted Dec 24, 2014 9:35 UTC (Wed)
by roc (subscriber, #30627)
[Link] (9 responses)
Also, copy-on-write fork()ing has other valuable uses. In rr we use it to create checkpoints very efficiently.
Posted Dec 24, 2014 9:54 UTC (Wed)
by epa (subscriber, #39769)
[Link] (3 responses)
Posted Dec 25, 2014 18:44 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link]
vfork does what you want. The "child" shares memory with the parent until it calls exec, so you avoid not only the commit charge catastrophe of copy-on-write fork, but also gain a significant performance boost from not having to copy the page tables.
Posted Dec 25, 2014 22:18 UTC (Thu)
by ldo (guest, #40946)
[Link] (1 responses)
Posted Dec 26, 2014 7:33 UTC (Fri)
by epa (subscriber, #39769)
[Link]
vfork() doesn't make a copy of the address space and so doesn't require either over-caution or over-committing. But it has other limitations.
Posted Dec 24, 2014 10:18 UTC (Wed)
by ibukanov (guest, #3942)
[Link] (4 responses)
To customize the child before exec one does not need full fork, vfork is enough and does not require overcommit.
Posted Dec 24, 2014 11:34 UTC (Wed)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
Posted Dec 24, 2014 12:31 UTC (Wed)
by ibukanov (guest, #3942)
[Link] (1 responses)
Posted Dec 30, 2014 0:04 UTC (Tue)
by klossner (subscriber, #30046)
[Link]
Posted Dec 24, 2014 14:27 UTC (Wed)
by justincormack (subscriber, #70439)
[Link]
Posted Dec 24, 2014 14:12 UTC (Wed)
by patrakov (subscriber, #97174)
[Link]
"overcommit is not available (lack of fork in Windows API allows for that)"
It is not the lack of fork() that makes overcommit "unneeded", but an explicit decision. Windows MapViewOfFile() API allows creation of copy-on-write mappings (which are usually the basis for overcommit) through the FILE_MAP_COPY flag. Although, as documented, Windows does always reserve the whole size of the memory region in the pagefile.
See http://msdn.microsoft.com/en-us/library/windows/desktop/a...
Posted Dec 24, 2014 15:33 UTC (Wed)
by dumain (subscriber, #82016)
[Link] (6 responses)
Posted Dec 25, 2014 18:50 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (5 responses)
MAP_NORESERVE being ignored when overcommit=2 isn't a "gotcha". It's fundamental brokenness.
[1] https://www.kernel.org/doc/Documentation/vm/overcommit-ac...
Posted Dec 25, 2014 19:04 UTC (Thu)
by fw (subscriber, #26023)
[Link] (4 responses)
MAP_NORESERVE is not named appropriately, but it does what it does. The problem is that programmers do not know about overcommit mode 2, do not test with it, and hence never realize the need for the PROT_NONE approach.
Posted Dec 25, 2014 19:58 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (3 responses)
Sadly, the commit charge implications of MAP_NORESERVE are documented but silently broken, but the commit charge implications of PROT_NONE are undocumented and in theory mutable in future releases. All this gives mmap(2) a Rusty Russel score of around -4.
Posted Dec 25, 2014 20:14 UTC (Thu)
by fw (subscriber, #26023)
[Link] (2 responses)
Posted Dec 25, 2014 20:23 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
There's just tremendous confusion over exactly Linux does with commit charge even among people well-versed in memory management fundamentals. There's no clarity in the API either. MAP_NORESERVE is dangerous because it's the only publicly-documented knob for managing commit charge and it's a broken knob nobody is interested in fixing.
(And no, SIGSEGV is not something that will discourage use. Those who know what they're doing can write code that recovers perfectly safely from SIGSEGV.)
Posted Apr 18, 2019 6:36 UTC (Thu)
by thestinger (guest, #91827)
[Link]
The linux-man-pages project is not the official kernel documentation and has often been inaccurate about important things for many years. MAP_NORESERVE doesn't work as it describes at all.
It has no impact in the full overcommit mode or the memory accounting mode, which makes sense. It only has an impact in the heuristic overcommit mode, where it disables the heuristic for causing immediate failure.
Note how a read-only (or PROT_NONE) mapping with not data has no accounting cost:
For an anonymous or /dev/zero map
Posted Dec 24, 2014 22:24 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
1) Map a file with copy-on-write semantics from several processes. Windows does this just fine with the MapViewOfFile function.
2) Reserve a large amount of virtual RAM but don't actually allocate physical pages for it. It's also supported with MEM_RESERVE flag for the VirtualAllocEx function ( http://msdn.microsoft.com/en-us/library/windows/desktop/a... )
The main distinction is that Windows programs by default try to avoid OOM situations.
Posted Dec 24, 2014 23:32 UTC (Wed)
by ibukanov (guest, #3942)
[Link]
On Windows after MEM_RESERVE the memory is not available. One has to explicitly commit it using MEM_COMMIT. As for copy-on-write for mmap files Windows explicitly reserves the allocated memory in advance. In both cases Windows ensure as long as a program has a pointer to a writable memory, that memory is always available. That is, on Windows OOM failure can only happen during allocation calls, not during an arbitrary write in a program as it happens on Linux
Posted Dec 25, 2014 19:17 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link]
In NT, the operations are separate. A program can set aside N pages of its address space, but it's only when it commits some of those pages that the kernel guarantees that it will be able to provide M, M<=N, distinct pages of information. After a commit operation succeeds, the system guarantees that it will be able to provide the requested pages. There is no OOM killer because the kernel can never work itself into a position where one might be necessary. While it's true that a process can reserve more address space than memory exists on the system, in order to use that memory, it must first make a commit system call, and *that call* can fail. That's not overcommit. That's sane, strict accounting. Your second point is based on a misunderstanding.
Your first point is also based on a misunderstanding. If two processes have mapped writable sections of a file, these mappings are either shared or private (and copy-on-write). Shared mappings do not incur a commit charge overhead because the pages in file-backed shared mappings are backed by the mapped files themselves. Private mappings are copy-on-write, but the entire commit charge for a copy-on-write mapping is assessed *at the time the mapping is created*, and operations that create file mappings can fail. Once they succeed, COW mappings are as committed as any other private, pagefile-backed mapping of equal size. Again, no overcommit. Just regular strict accounting.
From MSDN:
"When copy-on-write access is specified, the system and process commit charge taken is for the entire view because the calling process can potentially write to every page in the view, making all pages private. The contents of the new page are never written back to the original file and are lost when the view is unmapped."
The key problem with the Linux scheme is that, in NT terms, all mappings are SEC_RESERVE and are made SEC_COMMIT lazily on first access, and the penalty for a failed commit operation is sudden death of your process or a randomly chosen other process on the system. IMHO, Linux gets all this tragically wrong, and NT gets it right.
[1] Yes MAP_NORESERVE exists. Few people use it. Why bother? It's broken anyway, especially with overcommit off, when you most care about MAP_NORESERVE in the first place!
[2] Sure, the OOM killer might run in response to a page fault, but the result will either be the death of some *other* process or the death of the process performing the page fault. Either way, that process never observes a failure, in the latter case because it's dead already. Let's ignore file-backed memory on volatile media too.
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
That issue was solved a long time ago, which is why the vfork(2) hack is obsolete nowadays.
Re: shouldn't require a complete copy of the parent process's memory
Re: shouldn't require a complete copy of the parent process's memory
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The man page says this about MAP_NORESERVE:
The "too small to fail" memory-allocation rule
Do not reserve swap space for this mapping. When swap space is reserved,
one has the guarantee that it is possible to modify the mapping. When swap
space is not reserved one might get SIGSEGV upon a write if no physical
memory is available. See also the discussion of the file
/proc/sys/vm/overcommit_memory in proc(5). In kernels before 2.6, this
flag had effect only for private writable mappings.
That looks a lot like a a flag for mapping that doesn't reserve commit charge. There's no mention of the flag not actually working; there's no reference to better techniques. MAP_NORESERVE is just an attractive nuisance.
The "too small to fail" memory-allocation rule
SHARED - size of mapping
PRIVATE READ-only - 0 cost (but of little use)
PRIVATE WRITABLE - size of mapping per instance
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule