As a general thing, we pervasively trade memory allocation for speed throughout the software stack. Essentially every mainstream memory allocator rounds up memory allocation sizes or otherwise wastes space; most user level memory allocators request memory from the OS in large blocks to reduce system call overhead and don’t meticulously hand back every now-empty page; data structures like hash tables are not sized minimally and are often resized in large jumps; CPUs prefer or require alignment that leaves unused space in structures; etc etc. There are environments where we don’t do this and every byte is sacred, but they’re rare.
I believe this behaviour dates back to Vax BSD, though I could be wrong. There are basically two choices possible, Windows makes one and almost everyone else makes the other.
You can have a notion of commit charge, where every memory mapping requires accounting. This must also track kernel memory. The total available commit is the sum of memory and swap and every page you map charges one page of commit. If you cannot do this, a mapping fails. This doesn’t require that you allocate pages immediately, it simply requires that you ensure that you are able to.
Alternatively, you can allow any mapping to succeed and, when the pages are actually written, try to find some physical memory. If that fails (via whatever fallback paths you have, including swap, flushing caches, OOM killer, and so on) then you deliver a segfault or similar.
At first glance, option one looks better. In practice, there are a lot of reasons why this is not the case.
First, it requires that all memory is fungible. The NT kernel goes to a lot of effort to make this almost true. Almost every page in the kernel can be paged out, even page-table pages. An invalid PTE has one bit marking it as not valid and all of the rest are then available for software use. These are used to encode the swap file / device and offset for paged-out pages. If you hit a not-present page during page-table walk, you will pull in the page-table pages, then the data page. Eventually you end up with the page. Page-table pages and everything else in the kernel need to be accounted to a process’s commit charge.
The accounting can be hard. It used to be (not tried it recently) possible to crash a Linux system by mmaping a 4 KiB page from the same file / shared memory object at every other page offset in your address space. The kernel would build enormous page tables and these would exhaust memory, so the OOM killer would start. Your process is using basically no memory, so it would keep killing other processes. It used to hit the X server and then init, but I think these are now explicitly excluded. Not sure what happens when the kernel actually exhausts memory.
The fungibility gets really hard with features like MTE or CHERI, which have some out-of-band metadata associated with memory. Now two pages of memory, one with MTE tags and another without, use different amounts of swap. That complicates accounting. It gets worse when you add in memory compression (most modern operating systems have a small pool of compressed pages, which avoid swapping, and then swap from that region so that the amount written to disk is reduced). This means that ten pages of memory may be the same size as two pages of disk, or ten pages of disk, depending on the data.
This amplifies the second problem: efficient use of resources. Ideally, you would be using 100% of your RAM at all times. Anything that isn’t used for applications can be used as extra disk cache. Without overcommit, allocations fail as soon as you exhaust commit charge. Memory compression (or any other non-fungibility) means that you can exhaust commit charge before you exhaust memory. If a program allocates a large buffer but just uses a part of it, you will charge the whole thing. If a hundred processes do this, you’ll have a lot of wasted memory. The Windows desktop I had at MS had 128 GiB of RAM and allocations would fail with 50+ GiBs free until I configured 512 GiB of swap. You’re accounting for the worst case, not the common case, so in the common case you’ll be using the resources incredibly inefficiently. This then impacts the third problem, because you’re making memory exhaustion happen a lot earlier than it otherwise would.
The third problem is that memory allocation failures are really hard to handle gracefully. Embedded systems, where allocation failures are common, typically do[1] but it’s very hard to do in more complex programs. In general, for reliability, making allocation failures rare is better than making them easy to handle, because almost no one actually handles them correctly.
And, it turns out, there are a lot of ways that you can handle memory exhaustion that work well at a system level. Flushing buffers back to disk (most operating systems unify swap with file caches these days, so any clean page can just be discarded, others can be written to disk) is a first step. XNU, Windows, and Android provide events for the kernel when memory is tight[2] and so you can trigger GC, flush caches, and so on. The Android runtime, I believe, puts GC in a more aggressive mode when it gets this event. On macOS / iOS, the libcache / NSCache infrastructure lets apps store objects that may be freed when memory is constrained. XNU goes one step further with Sudden Termination. Processes can mark themselves as having no unsaved data and the kernel can kill them and reap their memory with not further interaction. All of these mean that seeing an allocation failure is a very rare occurrence.
The concept of memory that you want to have but you can easily get rid of if necessary doesn’t play nicely with a commit charge model at all. If RAM is plentiful, you may tweak your GC heuristics to run infrequently, but then collect more aggressively when memory is tight. How much memory should be committed? You don’t actually know until you’ve done the GC pass. If a browser has a 1 GiB cache in memory but can discard all of it when memory is low, how much should its commit charge be for this? What if it needs to allocate 1 MiB to be able to do the book keeping to discard that 1 GiB? If a process is in Sudden Termination state, what should its commit charge be? What should it then be if it receives an event?
[1] Though we recently found a place in the FreeRTOS TCP stack where it didn’t, and this is MISRA C, which is supposed to make this impossible. Unfortunately, the allocation failure was in a function that propagated the error to the caller and the caller cast the result to void to silence the static analyser warnings. Ooops.
[2] The Windows ones, unfortunately, are useless, because they warn you when memory is low, not when commit is low and so will fail to fire even when allocation is failing. They’re intended to let .NET run a GC instead of swapping.
4.x BSD Unix had a fairly different behavior than modern systems. BSD fixed the size of kernel memory, including the disk cache, and it required every page of user memory to be backed by a page of swap. I believe the initial BSDs did not do copy-on-write for fork() (hence vfork()); my fallible memory remembers this as being due to hardware issues on the VAX 11/750 that made COW not work in some situations. This BSD behavior was very simple to implement and quite reliable, but it obviously wasn’t really great, especially once kernels started dynamically sizing things like the disk cache (which I believe arrived with mmap() and the unified buffer cache in SunOS 4), and Unix implementations soon moved to more sophisticated approaches.
(This 4.x BSD behavior is the ultimate source of various traditional pieces of advice about Unix swap sizing, like ‘2x your RAM’.)
Ah, thank you. In that case, the most likely source of this is the Mach VM subsystem. This was merged back into the BSD family in the early ’90s sometime and leads to some interesting terminology in the VM / pmap layer.
It used to hit the X server and then init, but I think these are now explicitly excluded. Not sure what happens when the kernel actually exhausts memory.
In the bin of ‘experiments turned into things I use fairly often but comes with sharp corners’ is to bind it to some more testable forms in hopes for added synergy. The scripting API in Arcan has ‘target_store’ / ‘target_restore’ calls with a corresponding estimation event from the client end (STATESIZE). The tactic is that the client is supposed to serialize/deserialize on an inbound descriptor with enough state to best-effort store/restore itself. It was tested against emulators (as they tend to naturally have a well defined definition of this) to implement features like rewind or for running tool-assisted speedrun recordings as ‘free’ large sets of timing sensitive test cases.
Then it’s used for network migration and for persisting your config / state on a server (death to dotfiles!). This is how I have my regular desktop build as read-only bootable image that pulls down the WM, its config and then launch targets from the closely monitored server. There are some optimizations left to get it down to being about as fast as a suspend/resume cycle so that I can ignore that frail and leaky approach entirely.
Another emergent case is to drag and drop state/config between TUI clients (including the Lash shells like Cat9) and recursively for when Arcan applications are composing eachother. In the later cases that effectively results in a GC flush and a rebuild of the Lua VM memory space. The missing link is to wire the layers together to let the governing privsep process (that also provides watchdog like application-not-responding detection) act as an oomd - fake ‘crash recovery’ and have the WM application prioritize and store/restore or reset clients based on the metadata available to it.
The security angle on top of everything with ‘diskless malware’ (or the term from phrack which I prefer, process parasites) you can then establish a baseline of what is supposed to be in a process and compare against forensic snapshots to find what is accidentally there. This is a tall order when it comes to whole-system virtualization in virtual machine and browser cases, but perhaps one day with hardware assist, every byte is attributable and explainable.
The last time I used Solaris properly was Solaris 8 and it was still using a brk-based malloc that never returned memory to the kernel. The real physical memory usage of a process was the peak of the maximum number of allocated objects. I believe Solaris 10 had a mmap-based allocator, but I’m not sure.
I mostly used Solaris 8 because it still shipped with an X server with the xdps extension and I wanted to try GNUstep with it.
memory allocation failures are really hard to handle gracefully
I can attest to this. I know I’ve said so before, and it’s one of those old-timers’ “back when I was a kid…” things, but it bears repeating. I spent 12 years coding for the “classic” MacOS, where running out of memory was an expected situation and any shippable code damn well had to survive it without crashing or users would be furious.
This was incredibly hard to do, because there are so very many code paths that can fail and it’s infeasible to test them all. Also, you have to ensure that the error handlers / cleanup code don’t allocate memory and trigger a recursive error…
And keep in mind that a modern program probably does at least two orders of magnitude more allocation than a Classic MacOS app did. A problem that was hard back then is even harder now.
Linux is not unique in allowing malloc to succeed but then killing stuff when you try to use the returned memory, macOS does the same thing. This is due to compressed swap, and allowing you to malloc as long as there is enough space for more swap. If you actually put data into your allocations, they won’t compress as well, and suddenly you might not have enough space anymore.
As a general thing, we pervasively trade memory allocation for speed throughout the software stack. Essentially every mainstream memory allocator rounds up memory allocation sizes or otherwise wastes space; most user level memory allocators request memory from the OS in large blocks to reduce system call overhead and don’t meticulously hand back every now-empty page; data structures like hash tables are not sized minimally and are often resized in large jumps; CPUs prefer or require alignment that leaves unused space in structures; etc etc. There are environments where we don’t do this and every byte is sacred, but they’re rare.
I was going to link this as further reading to https://lobste.rs/s/qyrsfx/practices_reliable_software_design#c_hmrnhd, for any reader of that comment unfamiliar with Linux overcommit, but I realized that it had never been submitted as a story, so I thought I’d do so.
I believe this behaviour dates back to Vax BSD, though I could be wrong. There are basically two choices possible, Windows makes one and almost everyone else makes the other.
You can have a notion of commit charge, where every memory mapping requires accounting. This must also track kernel memory. The total available commit is the sum of memory and swap and every page you map charges one page of commit. If you cannot do this, a mapping fails. This doesn’t require that you allocate pages immediately, it simply requires that you ensure that you are able to.
Alternatively, you can allow any mapping to succeed and, when the pages are actually written, try to find some physical memory. If that fails (via whatever fallback paths you have, including swap, flushing caches, OOM killer, and so on) then you deliver a segfault or similar.
At first glance, option one looks better. In practice, there are a lot of reasons why this is not the case.
First, it requires that all memory is fungible. The NT kernel goes to a lot of effort to make this almost true. Almost every page in the kernel can be paged out, even page-table pages. An invalid PTE has one bit marking it as not valid and all of the rest are then available for software use. These are used to encode the swap file / device and offset for paged-out pages. If you hit a not-present page during page-table walk, you will pull in the page-table pages, then the data page. Eventually you end up with the page. Page-table pages and everything else in the kernel need to be accounted to a process’s commit charge.
The accounting can be hard. It used to be (not tried it recently) possible to crash a Linux system by
mmap
ing a 4 KiB page from the same file / shared memory object at every other page offset in your address space. The kernel would build enormous page tables and these would exhaust memory, so the OOM killer would start. Your process is using basically no memory, so it would keep killing other processes. It used to hit the X server and then init, but I think these are now explicitly excluded. Not sure what happens when the kernel actually exhausts memory.The fungibility gets really hard with features like MTE or CHERI, which have some out-of-band metadata associated with memory. Now two pages of memory, one with MTE tags and another without, use different amounts of swap. That complicates accounting. It gets worse when you add in memory compression (most modern operating systems have a small pool of compressed pages, which avoid swapping, and then swap from that region so that the amount written to disk is reduced). This means that ten pages of memory may be the same size as two pages of disk, or ten pages of disk, depending on the data.
This amplifies the second problem: efficient use of resources. Ideally, you would be using 100% of your RAM at all times. Anything that isn’t used for applications can be used as extra disk cache. Without overcommit, allocations fail as soon as you exhaust commit charge. Memory compression (or any other non-fungibility) means that you can exhaust commit charge before you exhaust memory. If a program allocates a large buffer but just uses a part of it, you will charge the whole thing. If a hundred processes do this, you’ll have a lot of wasted memory. The Windows desktop I had at MS had 128 GiB of RAM and allocations would fail with 50+ GiBs free until I configured 512 GiB of swap. You’re accounting for the worst case, not the common case, so in the common case you’ll be using the resources incredibly inefficiently. This then impacts the third problem, because you’re making memory exhaustion happen a lot earlier than it otherwise would.
The third problem is that memory allocation failures are really hard to handle gracefully. Embedded systems, where allocation failures are common, typically do[1] but it’s very hard to do in more complex programs. In general, for reliability, making allocation failures rare is better than making them easy to handle, because almost no one actually handles them correctly.
And, it turns out, there are a lot of ways that you can handle memory exhaustion that work well at a system level. Flushing buffers back to disk (most operating systems unify swap with file caches these days, so any clean page can just be discarded, others can be written to disk) is a first step. XNU, Windows, and Android provide events for the kernel when memory is tight[2] and so you can trigger GC, flush caches, and so on. The Android runtime, I believe, puts GC in a more aggressive mode when it gets this event. On macOS / iOS, the libcache / NSCache infrastructure lets apps store objects that may be freed when memory is constrained. XNU goes one step further with Sudden Termination. Processes can mark themselves as having no unsaved data and the kernel can kill them and reap their memory with not further interaction. All of these mean that seeing an allocation failure is a very rare occurrence.
The concept of memory that you want to have but you can easily get rid of if necessary doesn’t play nicely with a commit charge model at all. If RAM is plentiful, you may tweak your GC heuristics to run infrequently, but then collect more aggressively when memory is tight. How much memory should be committed? You don’t actually know until you’ve done the GC pass. If a browser has a 1 GiB cache in memory but can discard all of it when memory is low, how much should its commit charge be for this? What if it needs to allocate 1 MiB to be able to do the book keeping to discard that 1 GiB? If a process is in Sudden Termination state, what should its commit charge be? What should it then be if it receives an event?
[1] Though we recently found a place in the FreeRTOS TCP stack where it didn’t, and this is MISRA C, which is supposed to make this impossible. Unfortunately, the allocation failure was in a function that propagated the error to the caller and the caller cast the result to void to silence the static analyser warnings. Ooops.
[2] The Windows ones, unfortunately, are useless, because they warn you when memory is low, not when commit is low and so will fail to fire even when allocation is failing. They’re intended to let .NET run a GC instead of swapping.
4.x BSD Unix had a fairly different behavior than modern systems. BSD fixed the size of kernel memory, including the disk cache, and it required every page of user memory to be backed by a page of swap. I believe the initial BSDs did not do copy-on-write for fork() (hence vfork()); my fallible memory remembers this as being due to hardware issues on the VAX 11/750 that made COW not work in some situations. This BSD behavior was very simple to implement and quite reliable, but it obviously wasn’t really great, especially once kernels started dynamically sizing things like the disk cache (which I believe arrived with mmap() and the unified buffer cache in SunOS 4), and Unix implementations soon moved to more sophisticated approaches.
(This 4.x BSD behavior is the ultimate source of various traditional pieces of advice about Unix swap sizing, like ‘2x your RAM’.)
Ah, thank you. In that case, the most likely source of this is the Mach VM subsystem. This was merged back into the BSD family in the early ’90s sometime and leads to some interesting terminology in the VM / pmap layer.
AFAIR there is now a more generic mechanism for it, deferring the decision to some other process: in proc/pid/oom_adj you set an OOM version of the nice level. Then you have https://engineering.fb.com/2018/07/19/production-engineering/oomd/
In the bin of ‘experiments turned into things I use fairly often but comes with sharp corners’ is to bind it to some more testable forms in hopes for added synergy. The scripting API in Arcan has ‘target_store’ / ‘target_restore’ calls with a corresponding estimation event from the client end (STATESIZE). The tactic is that the client is supposed to serialize/deserialize on an inbound descriptor with enough state to best-effort store/restore itself. It was tested against emulators (as they tend to naturally have a well defined definition of this) to implement features like rewind or for running tool-assisted speedrun recordings as ‘free’ large sets of timing sensitive test cases.
Then it’s used for network migration and for persisting your config / state on a server (death to dotfiles!). This is how I have my regular desktop build as read-only bootable image that pulls down the WM, its config and then launch targets from the closely monitored server. There are some optimizations left to get it down to being about as fast as a suspend/resume cycle so that I can ignore that frail and leaky approach entirely.
Another emergent case is to drag and drop state/config between TUI clients (including the Lash shells like Cat9) and recursively for when Arcan applications are composing eachother. In the later cases that effectively results in a GC flush and a rebuild of the Lua VM memory space. The missing link is to wire the layers together to let the governing privsep process (that also provides watchdog like application-not-responding detection) act as an oomd - fake ‘crash recovery’ and have the WM application prioritize and store/restore or reset clients based on the metadata available to it.
The security angle on top of everything with ‘diskless malware’ (or the term from phrack which I prefer, process parasites) you can then establish a baseline of what is supposed to be in a process and compare against forensic snapshots to find what is accidentally there. This is a tall order when it comes to whole-system virtualization in virtual machine and browser cases, but perhaps one day with hardware assist, every byte is attributable and explainable.
I think the other fairly well-known system that does not overcommit is Solaris, though my knowledge is a couple of decades out of date!
The last time I used Solaris properly was Solaris 8 and it was still using a brk-based malloc that never returned memory to the kernel. The real physical memory usage of a process was the peak of the maximum number of allocated objects. I believe Solaris 10 had a mmap-based allocator, but I’m not sure.
I mostly used Solaris 8 because it still shipped with an X server with the xdps extension and I wanted to try GNUstep with it.
I can attest to this. I know I’ve said so before, and it’s one of those old-timers’ “back when I was a kid…” things, but it bears repeating. I spent 12 years coding for the “classic” MacOS, where running out of memory was an expected situation and any shippable code damn well had to survive it without crashing or users would be furious.
This was incredibly hard to do, because there are so very many code paths that can fail and it’s infeasible to test them all. Also, you have to ensure that the error handlers / cleanup code don’t allocate memory and trigger a recursive error…
And keep in mind that a modern program probably does at least two orders of magnitude more allocation than a Classic MacOS app did. A problem that was hard back then is even harder now.
Linux is not unique in allowing malloc to succeed but then killing stuff when you try to use the returned memory, macOS does the same thing. This is due to compressed swap, and allowing you to malloc as long as there is enough space for more swap. If you actually put data into your allocations, they won’t compress as well, and suddenly you might not have enough space anymore.