The "too small to fail" memory-allocation rule
The "too small to fail" memory-allocation rule
Posted Dec 23, 2014 23:15 UTC (Tue) by cesarb (subscriber, #6266)Parent article: The "too small to fail" memory-allocation rule
If I understood it correctly, the problem is a "recursive locking" deadlock: the filesystem does a memory allocation under a lock, which waits for a process to exit, which calls into the filesystem, which needs the same lock again.
Doesn't the kernel already have a way of saying "I'm the filesystem, do not call the filesystem to free memory", that is, GFP_NOFS (and the related GFP_NOIO)? Couldn't the meaning of that flag be extended to also mean "don't wait for the OOM killer even for small allocations", since filesystem and memory management code have a higher probability of having a good quality error recovery code?
Posted Dec 24, 2014 3:13 UTC (Wed)
by neilbrown (subscriber, #359)
[Link]
The allocation is question is __GFP_NOWAIT which is equivalent to GFP_NOIO.
Invoking the OOM killer is reasonable, waiting a little while is reasonable, but waiting for the killed processes to flush data or close files is not..... Hmmm. The out_of_memory() function chooses a processes, kills it, then waits one second ("schedule_timeout_killable(1);"), so it seems to be behaving as I would expect.
Reading more of the email threads it seems that TIF_MEMDIE is an important part of the problem. This is set on a thread when it has been chosen to die. As long as any thread has this set, "select_bad_process()" will not select anything else so nothing else gets killed.
So the "real" problem seems to be that if a process with TIF_MEMDIE set enters filesystem (or IO) code and blocks, a GFP_NOIO allocation can be made to wait for that filesystem code to complete - which could deadlock.
Looking at the original patch which started this: http://marc.info/?l=linux-mm&m=141839249819519&w=2 the problem seems to involve a process with TIF_MEMDIE set, which has already released its memory, but is now stuck in the filesystem, probably closing some files.
As it has TIF_MEMDIE set, nothing else will be killed. So no memory will become available, so it will stay stuck in the filesystem...
This deadlock could possibly be avoided by having oom_scan_process_thread() not abort the scan if TIF_MEMDIE is set on a process which has already dropped its memory. Then another process can be killed even if the first one blocked
But that analysis was very hasty and probably misses something important. "Here be dragons" is how I would mark this code on a map of Linux.
Posted Jan 15, 2015 19:40 UTC (Thu)
by ksandstr (guest, #60862)
[Link]
Second, that the allocator enters the OOM killer on its own volition, changing malloc's locking behaviour from "only ever the heap lock" to "everything an asynchronous OOM killer might end up with". That might well cover all of the kernel, when the caller is expecting malloc to be atomic by itself.
What's surprising isn't so much the reams of untested malloc-failure handling code, but that this bug comes up as late as 2014.
The "too small to fail" memory-allocation rule
Such an allocation must not risk waiting for FS or IO activity. Yet waiting for the OOM killer can clearly do that. This is a violation of the interface whether you accept the "too small to fail" rule or not.
The "too small to fail" memory-allocation rule