The "too small to fail" memory-allocation rule

Posted Dec 23, 2014 23:15 UTC (Tue) by cesarb (subscriber, #6266)
Parent article: The "too small to fail" memory-allocation rule

> That leaves memory-management developers in a bit of a bind. Causing memory allocation functions to behave as advertised seems certain to introduce difficult-to-debug problems into the kernel. But the status quo has downsides of its own, and they could get worse as kernel locking becomes more complicated.

If I understood it correctly, the problem is a "recursive locking" deadlock: the filesystem does a memory allocation under a lock, which waits for a process to exit, which calls into the filesystem, which needs the same lock again.

Doesn't the kernel already have a way of saying "I'm the filesystem, do not call the filesystem to free memory", that is, GFP_NOFS (and the related GFP_NOIO)? Couldn't the meaning of that flag be extended to also mean "don't wait for the OOM killer even for small allocations", since filesystem and memory management code have a higher probability of having a good quality error recovery code?

The "too small to fail" memory-allocation rule

Posted Dec 24, 2014 3:13 UTC (Wed) by neilbrown (subscriber, #359) [Link]

This looks to me to be a valid and relevant observation.

The allocation is question is __GFP_NOWAIT which is equivalent to GFP_NOIO.
Such an allocation must not risk waiting for FS or IO activity. Yet waiting for the OOM killer can clearly do that. This is a violation of the interface whether you accept the "too small to fail" rule or not.

Invoking the OOM killer is reasonable, waiting a little while is reasonable, but waiting for the killed processes to flush data or close files is not..... Hmmm. The out_of_memory() function chooses a processes, kills it, then waits one second ("schedule_timeout_killable(1);"), so it seems to be behaving as I would expect.

Reading more of the email threads it seems that TIF_MEMDIE is an important part of the problem. This is set on a thread when it has been chosen to die. As long as any thread has this set, "select_bad_process()" will not select anything else so nothing else gets killed.

So the "real" problem seems to be that if a process with TIF_MEMDIE set enters filesystem (or IO) code and blocks, a GFP_NOIO allocation can be made to wait for that filesystem code to complete - which could deadlock.

Looking at the original patch which started this: http://marc.info/?l=linux-mm&m=141839249819519&w=2 the problem seems to involve a process with TIF_MEMDIE set, which has already released its memory, but is now stuck in the filesystem, probably closing some files.

As it has TIF_MEMDIE set, nothing else will be killed. So no memory will become available, so it will stay stuck in the filesystem...

This deadlock could possibly be avoided by having oom_scan_process_thread() not abort the scan if TIF_MEMDIE is set on a process which has already dropped its memory. Then another process can be killed even if the first one blocked

But that analysis was very hasty and probably misses something important. "Here be dragons" is how I would mark this code on a map of Linux.

The "too small to fail" memory-allocation rule

Posted Jan 15, 2015 19:40 UTC (Thu) by ksandstr (guest, #60862) [Link]

Indeed it's bizarre that an automatic attempt to avert malloc failure would potentially re-enter its caller and therefore guarantee a lock-ordering violation. First, the allocator is at all called with a lock held, implying that allocation length follows from state which the lock is used to guard; which in turn implies overly eager design.

Second, that the allocator enters the OOM killer on its own volition, changing malloc's locking behaviour from "only ever the heap lock" to "everything an asynchronous OOM killer might end up with". That might well cover all of the kernel, when the caller is expecting malloc to be atomic by itself.

What's surprising isn't so much the reams of untested malloc-failure handling code, but that this bug comes up as late as 2014.