Imagine that you go to the bank, and ask for a 100,000$ mortgage. The nice guy in the bank agrees to lend you the money, and since you need to pay that in 5 installments, you take 15,000$ to the contractor, and leave the rest in the bank until it is needed. The bank is doing brisk business, and promise a lot of customers that they can get their mortgage in the bank. Since most of the mortgages are also taken in installments, the bank never actually have enough money to hand over to all lenders. But it make do.
Until one dark day when you come to the bank and ask for the rest of the money, because it is time to install the kitchen cabinets, and you need to pay for that. The nice man in the bank tell you to wait a bit, and goes to see if they have any money. At this point, it would be embarrassing to tell you that they don’t have any money to give you, because they over committed themselves. The nice man from the bank murders you and bury your body in the desert, to avoid you complaining that you didn’t get the money that you were promised. Actually, the nice man might go ahead and kill someone else (robbing them in the process), and then give you their money. You go home happy to your blood stained kitchen cabinets.
That is how memory management works in Linux.
After this dramatic opening, let us get down to what is really going on. Linux has a major problem. Its process model means that it is stuck up a tree and the only way down is via free fall. Whenever a process wants to create another process, the standard method in Linux is to call fork() and then call execv() to execute the new binary. The problem here is what fork() does. It needs to copy the entire process state to the new process. That include all memory, handles, registers, etc.
Let us assume that we have a process that allocated 1GB of memory for reading and writing, and then called fork(). The way things are setup, it is pretty cheap to create the new process, all we need to do is duplicate the kernel data structures and we are done. However, what happens when the memory that the process allocated? The fork() call requires that both processes will have access to that memory, and also that both of them may modify it. That means that we have a copy on write situation. Whenever one of the processes modify the memory, it is forcing the OS to copy that piece of memory to another physical memory location and remap the virtual addresses.
This allows the developer to do some really cool stuff. Redis implemented its backup strategy via the fork() call. By forking and then dumping the in memory process state to disk it can get consistent snapshot of the system with almost no code. It is the OS that is responsible for maintaining that invariant.
Speaking of invariants, it also means that there is absolutely no way that Linux can manage memory properly. If we have 2 GB of RAM on the machine, and we have a 1GB process that fork()-ed, what is going to happen? Well, it was promised 1 GB of RAM, and it got that. And it was also promised by fork() that both processes will be able to modify the full 1GB of RAM. If we also have some other processes taking memory (and assuming no swap for the moment), that pretty much means that someone is going to end up holding the dirty end of the stick.
Now, Linux has a configuration option that would prevent it (vm.overcommit_memory = 2, and the over commit ratio, but that isn’t really important. I’m including this here for the nitpickers, and yes, I’m aware that you can set oom_adj = –17 to protect myself from this issue, not the point.). This tell Linux that it shouldn’t over commit. In such cases, it would mean that the fork() method call would fail, and you’ll be left with an effectively a crippled system. So, we have the potential for a broken invariant. What is going to happen now?
Well, Linux promised you memory, and after exhausting all of the physical memory, it will start paging to swap file. But that can be exhausted to. That is when the Out Of Memory Killer gets to play, and it takes an axe and start choosing a likely candidate to be mercilessly murdered. The “nice” thing about this is that there is no control over that, and you might be a perfectly well behaved process that the OOM just doesn’t like this Monday, so buh-bye!
Looking around, it seems that we aren’t the only one that had run head first into this issue. The Oracle recommendation is to set things up to panic and reboot the entire machine when this happens, and that seems… unproductive.
The problem is that as a database, we aren’t really in control of how much we allocate, and we rely on the system to tell us when we do too much. Linux has no facility to do things like warn applications that memory is low, or even letting us know that by refusing to allocate more memory. Both are things that we already support, and would be very helpful.
That is quite annoying.