Re: How to handle TIF_MEMDIE stalls?

[Posted December 23, 2014 by corbet]

From:		Michal Hocko <mhocko-AT-suse.cz>
To:		Dave Chinner <david-AT-fromorbit.com>
Subject:		Re: How to handle TIF_MEMDIE stalls?
Date:		Mon, 22 Dec 2014 17:57:36 +0100
Message-ID:		<[email protected]>
Cc:		Tetsuo Handa <penguin-kernel-AT-I-love.SAKURA.ne.jp>, dchinner-AT-redhat.com, linux-mm-AT-kvack.org, rientjes-AT-google.com, oleg-AT-redhat.com
Archive‑link:		Article

On Mon 22-12-14 07:42:49, Dave Chinner wrote:
[...]
> "memory reclaim gave up"? So why the hell isn't it returning a
> failure to the caller?
> 
> i.e. We have a perfectly good page cache allocation failure error
> path here all the way back to userspace, but we're invoking the
> OOM-killer to kill random processes rather than returning ENOMEM to
> the processes that are generating the memory demand?
> 
> Further: when did the oom-killer become the primary method
> of handling situations when memory allocation needs to fail?
> __GFP_WAIT does *not* mean memory allocation can't fail - that's what
> __GFP_NOFAIL means. And none of the page cache allocations use
> __GFP_NOFAIL, so why aren't we getting an allocation failure before
> the oom-killer is kicked?

Well, it has been an unwritten rule that GFP_KERNEL allocations for
low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago
decision which would be tricky to fix now without silently breaking a
lot of code. Sad...
Nevertheless the caller can prevent from an endless loop by using
__GFP_NORETRY so this could be used as a workaround. The default should
be opposite IMO and only those who really require some guarantee should
use a special flag for that purpose.

> > I guess __alloc_pages_direct_reclaim() returns NULL with did_some_progress > 0
> > so that __alloc_pages_may_oom() will not be called easily. As long as
> > try_to_free_pages() returns non-zero, __alloc_pages_direct_reclaim() might
> > return NULL with did_some_progress > 0. So, do_try_to_free_pages() is called
> > for many times and is likely to return non-zero. And when
> > __alloc_pages_may_oom() is called, TIF_MEMDIE is set on the thread waiting
> > for mutex_lock(&"struct inode"->i_mutex) at xfs_file_buffered_aio_write()
> > and I see no further progress.
> 
> Of course - TIF_MEMDIE doesn't do anything to the task that is
> blocked, and the SIGKILL signal can't be delivered until the syscall
> completes or the kernel code checks for pending signals and handles
> EINTR directly. Mutexes are uninterruptible by design so there's no
> EINTR processing, hence the oom killer cannot make progress when
> everything is blocked on mutexes waiting for memory allocation to
> succeed or fail.
> 
> i.e. until the lock holder exists from direct memory reclaim and
> releases the locks it holds, the oom killer will not be able to save
> the system. IOWs, the problem is that memory allocation is not
> failing when it should....
> 
> Focussing on the OOM killer here is the wrong way to solve this
> problem - the problem that needs to be solved is sane handling of
> OOM conditions to avoid needing to invoke the OOM-killer...

Completely agreed!

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected].  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>