Kernel development

Brief items

Kernel release status

The current development kernel is 3.19-rc1, released on December 20 — one day earlier than might have been expected.

Stable updates: none have been released in the last week.

Comments (none posted)

Live kernel patching on track for 3.20

Uptime-sensitive users have long been interested in the ability to apply patches to a running kernel without having to reboot the system. There are several out-of-tree implementations of this feature out there; it has always been clear that they could not all get into the mainline. According to a recent request from Jiri Kosina to have a live-patching tree added to linux-next, it seems that some progress has been made on this front. In particular, the developers of kpatch and kGraft have been working together:

The core functionality (which is self-contained) now works and has been Reviewed/Acked by both interested parties (i.e. people working on kPatch and kGraft) and agreed to be a common ground on which further development will happen.

The current plan is to try to get this common core pulled during the 3.20 merge window. Then, perhaps, we'll finally have live-patching capability in the mainline kernel.

Full Story (comments: 16)

Kernel development news

The end of the 3.19 merge window

By Jonathan Corbet
December 22, 2014

By the usual practice, the 3.19 merge window should have ended on December 21, two weeks after its opening. But the solstice celebration featuring 3.19-rc1 was not to be; Linus decided to close the merge window one day early. Among other things, his reasoning was that so much work had come in that there wasn't a whole lot of point in waiting for more:

Considering how much came in fairly late, I find it hard to care about anybody who had decided to cut it even closer than some people already did. That said, maybe there aren't any real stragglers - and judging by the size of rc1, there really can't have been much.

Indeed, this is a busy development cycle, with 11,408 changesets having been pulled during the merge window. That is more than was seen during the entire 3.18 development cycle, where 11,379 changesets were merged before the final release.

About 1000 changesets were pulled since last week's summary; some of the more interesting, user-visible changes in that set were:

The handling of the setgroups() system call in user namespaces has been changed in a way that could possibly break some applications; see this article for more information.
The Ceph filesystem now supports inline data, improving performance for small files. Ceph also supports message signing for authentication between clients and servers.
KVM virtualization support for the Itanium (ia64) architecture has been removed. It was not being maintained, and, seemingly, was not being used either.
The InfiniBand layer has gained support for on-demand paging. This feature allows an RDMA region to be set up and populated via page faults when the memory is actually used, thus avoiding the need to pin down a bunch of memory that may never be needed.
New hardware support includes:
- Input: Elan I2C/SMbus trackpads, Goodix I2C touchscreens, and Elan eKTH I2C touchscreens.
- Miscellaneous: Allwinner SoC-based NAND flash, Broadcom BCM2835 pulse-width modulator (PWM) controllers, and Atmel HLCDC PWM controllers.
- Thermal: Support for generic device cooling through clock frequency tweaking, NVIDIA Tegra SOCTHERM thermal management systems, and Rockchip thermal management systems.

Changes visible to kernel developers include:

The module removal code has been reworked to get rid of calls to the much-maligned stop_machine() function. In the process, module reference counting has been slowed slightly, but it is believed that nobody is tweaking reference counts often enough that the difference will be noticed.
The CONFIG_PM_RUNTIME configuration symbol has finally been eliminated from the kernel; everything uses CONFIG_PM now.
The READ_ONCE() and ASSIGN_ONCE() macros (described in this article) were merged in the final pull before the closing of the merge window. These macros enforce the use of ~~non-~~ scalar types, hopefully avoiding experiences with tricky compiler bugs.

If the usual schedule holds, the final 3.19 release can be expected sometime around the middle of March. Between now and then, though, a lot of testing and fixing has to be done — 11,400 new changes will certainly have brought a few bugs with them. That said, the early indications are that 3.19-rc1 is relatively stable for such an early release, so this may yet turn out to be another fast cycle despite the volume of patches.

Comments (4 posted)

CoreOS looks to move from Btrfs to overlayfs

By Jake Edge
December 24, 2014

After many years of different union filesystem implementations trying to get into the mainline, the overlay filesystem (also known as overlayfs) was finally merged for 3.18. It didn't take all that long for at least one project to notice and react to that addition. CoreOS, which is a Linux distribution for large server deployments, is now planning to move its root filesystem images from Btrfs to ext4—with overlayfs on top.

Various filesystem features are used by Docker (which is the mechanism used to run applications on CoreOS—at least for now) to put read-write filesystems atop the read-only base that provides the root filesystem. In addition, Docker applications each have their own read-only filesystem layer that is currently typically handled in CoreOS by using Btrfs's copy-on-write and snapshot features. Moving to a union filesystem like overlayfs will provide the same basic functionality, just using different underlying techniques.

Brandon Phillips proposed the switch to the coreos-dev mailing list on December 15 and the reaction has generally been quite positive. Btrfs is, it seems, still a bit immature. As Phillips noted: "CoreOS users have regularly reported bugs against btrfs including: out of disk space errors, metadata rebalancing problems requiring manual intervention and generally slow performance when compared to other filesystems".

That proposal was greeted by responses from several others who had seen the problems that Phillips mentioned. Seán C. McCord pointed out that he is a Btrfs proponent, but would still be happier using ext4 and overlayfs:

The out-of-space / metadata balancing problem has bitten me more times than I care to count. It's essentially a fact of life that I have to blow away /var/lib/docker and all its subvolumes every few weeks on any given machine, to clear an out-of-space problem (though `df` shows a usage of, say, 30%).

But, in the only real opposition seen in the thread, Morgaine Fowle noted his excitement about the features that Btrfs brings to the table and thinks CoreOS should be focusing on those, rather than what he sees as a cobbled-together solution using overlayfs. Furthermore:

I deeply enjoy the file-system taking responsibility for snapshotting. It creates a consistent management interface that's useful for a wide range of tasks. Anything based off overlayfs is going to have to concoct it's own unique management layer which will require it's own domain knowledge to handle, where-as someone proficient with the filesystem's snapshotting tools is going to have a more general, portable knowledge they'll be able to use to make sense of what CoreOS is doing naturally.

But, according to Phillips's proposal, overlayfs will bring some benefits beyond just more stability. He pointed to a Red Hat evaluation of storage options for Docker that showed overlayfs as a faster choice for creating and destroying containers. In addition, it also said that overlayfs uses memory more efficiently since it can keep a single copy of a file's pages in the page cache, which can then be used by multiple containers. Since there tends to be a lot of overlap between containers, this can result in significant performance improvements. There are some downsides to overlayfs, too, of course, including that changes to files in the underlying read-only layer requires a potentially costly copy-up operation.

Btrfs creator Chris Mason also posted to the thread. He noted that a number of the problems ("warts") that CoreOS users were running into are being addressed:

The 3.19 merge window fixes some very hard to find corruption problems that we've been chasing down, and Josef Bacik has developed a slick power-fail testing target that makes it much easier to prevent similar bugs in the future. 3.19 will also fix rare corruptions with block group removal, making both balance and the new auto-blockgroup cleanup feature much more reliable.

Overall, though, Mason was not particularly disappointed or unhappy about the proposal to switch to overlayfs, saying that CoreOS should choose the storage solution that best fits its needs. He was also glad to see projects looking to adopt overlayfs now that it has been added to the kernel. Similarly, Greg Kroah-Hartman congratulated CoreOS for using overlayfs in a post to Google+.

The main change outlined by Phillips would be to move the root filesystem images from Btrfs to ext4. Eventually, the Docker overlayfs graph backend would be made the default, but existing Btrfs-based CoreOS systems would continue to work as they are. Given that there were almost no complaints about the proposal, with multiple posts agreeing (as well as quite a few "+1" posts), it would appear to be the path forward for CoreOS.

It should be noted that overlayfs itself has only been in the kernel for a short time. The patches been around for quite a while now, and have been used by various distributions along the way, but it probably still has a few bugs that will need to be shaken out. It is far less complex than Btrfs, however, which presumably reduces the risks of switching from one immature storage technology to another. At this point, openSUSE is the only major distribution to have adopted Btrfs as its default filesystem, though others have discussed it.

One conclusion seems inevitable, though: even after many years of development, Btrfs has not reached a level of functionality, performance, and stability required by some. Mason's message provides some hope that we are getting there, but that has seemingly been true for a while now. When we actually get "there" is still anyone's guess at this point.

Comments (5 posted)

The "too small to fail" memory-allocation rule

By Jonathan Corbet
December 23, 2014

Kernel developers have long been told that, with few exceptions, attempts to allocate memory can fail if the system does not have sufficient resources. As a result, in well-written code, every call to a function like kmalloc(), vmalloc(), or __get_free_pages() is accompanied by carefully thought-out error-handling code. It turns out, though, the behavior actually implemented in the memory-management subsystem is a bit different from what is written in the brochure. That difference can lead to unfortunate run-time behavior, but the fix might just be worse.

A discussion on the topic began when Tetsuo Handa posted a question on how to handle a particular problem that had come up. The sequence of events was something like this:

A process that is currently using relatively little memory invokes an XFS filesystem operation that, in turn, needs to perform an allocation to proceed.
The memory management subsystem tries to satisfy the allocation, but finds that there is no memory available. It responds by first trying direct reclaim (forcing pages out of memory to free them), then, if that doesn't produce the needed free memory, it falls back to the out-of-memory (OOM) killer.
The OOM killer picks its victim and attempts to kill it.
To be able to exit, the victim must perform some operations on the same XFS filesystem. That involves acquiring locks that, as it happens, the process attempting to perform the problematic memory allocation is currently holding. Everything comes to a halt.

In other words, the allocating process cannot proceed because it is waiting for its allocation call to return. That call cannot return until memory is freed, which requires the victim process to exit. The OOM killer will also wait for the victim to exit before (possibly) choosing a second process to kill. But the victim process cannot exit because it needs locks held by the allocating process. The system locks up and the owner of the system starts to seriously consider a switch to some version of BSD.

When asked about this problem, XFS maintainer Dave Chinner quickly wondered why the memory-management code was resorting to the OOM killer rather than just failing the problematic memory allocation. The XFS code, he said, is nicely prepared to deal with an allocation failure; to him, using that code seems better than killing random processes and locking up the system as a whole. That is when memory management maintainer Michal Hocko dropped a bomb by saying:

Well, it has been an unwritten rule that GFP_KERNEL allocations for low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago decision which would be tricky to fix now without silently breaking a lot of code. Sad...

The resulting explosion could be heard in Dave's incredulous reply:

We have *always* been told memory allocations are not guaranteed to succeed, ever, unless __GFP_NOFAIL is set, but that's deprecated and nobody is allowed to use it any more.

Lots of code has dependencies on memory allocation making progress or failing for the system to work in low memory situations. The page cache is one of them, which means all filesystems have that dependency. We don't explicitly ask memory allocations to fail, we *expect* the memory allocation failures will occur in low memory conditions. We've been designing and writing code with this in mind for the past 15 years.

A "too small to fail" allocation is, in most kernels, one of eight contiguous pages or less — relatively big, in other words. Nobody really knows when the rule that these allocations could not fail went into the kernel; it predates the Git era. As Johannes Weiner explained, the idea was that, if such small allocations could not be satisfied, the system was going to be so unusable that there was no practical alternative to invoking the OOM killer. That may be the case, but locking up the system in a situation where the kernel is prepared to cope with an allocation failure also leads to a situation where things are unusable.

One alternative that was mentioned in the discussion was to add the __GFP_NORETRY flag to specific allocation requests. That flag causes even small allocation requests to fail if the resources are not available. But, as Dave noted, trying to fix potentially deadlocking requests with __GFP_NORETRY is a game of Whack-A-Mole; there are always more moles, and they tend to win in the end.

The alternative would be to get rid of the "too small to fail" rule and make the allocation functions work the way most kernel developers expect them to. Johannes's message included a patch moving things in that direction; it causes the endless reclaim loop to exit (and fail an allocation request) if attempts at direct reclaim do not succeed in actually freeing any memory. But, as he put it, "the thought of failing order-0 allocations after such a long time is scary."

It is scary for a couple of reasons. One is that not all kernel developers are diligent about checking every memory allocation and thinking about a proper recovery path. But it is worse than that: since small allocations do not fail, almost none of the thousands of error-recovery paths in the kernel now are ever exercised. They could be tested if developers were to make use of the the kernel's fault injection framework, but, in practice, it seems that few developers do so. So those error-recovery paths are not just unused and subject to bit rot; chances are that a discouragingly large portion of them have never been tested in the first place.

If the unwritten "too small to fail" rule were to be repealed, all of those error-recovery paths would become live code for the first time. In a sense, the kernel would gain thousands of lines of untested code that only run in rare circumstances where things are already going wrong. There can be no doubt that a number of obscure bugs and potential security problems would result.

That leaves memory-management developers in a bit of a bind. Causing memory allocation functions to behave as advertised seems certain to introduce difficult-to-debug problems into the kernel. But the status quo has downsides of its own, and they could get worse as kernel locking becomes more complicated. It also wastes the considerable development time that goes toward the creation of error-recovery code that will never be executed. Even so, introducing low-order memory-allocation failures at this late date may well prove too scary to be attempted, even if the long-term result would be a better kernel.

Comments (195 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.19-rc1 - merge window closed ?

Luis Henriques Linux 3.16.7-ckt3 ?

Kamal Mostafa Linux 3.13.11-ckt13 ?

Architecture-specific

Baruch Siach ARM: Conexant Digicolor CX92755 SoC support ?

Wang Long ARM: hisi: enable HiP01 SoC ?

Core kernel code

Khalid Aziz sched/fair: Add advisory flag for borrowing a timeslice ?

Device drivers

Jonathan Richardson Add support for Broadcom iProc touchscreen ?

[email protected] Add Skyworks SKY81452 device drivers ?

Pavel Machek bluetooth: Add hci_h4p driver ?

Liu Ying Add support for i.MX MIPI DSI DRM driver ?

Adam Thomson Add initial support for DA9150 Charger & Fuel-Gauge IC ?

Chanwoo Choi devfreq: Add generic exynos memory-bus frequency driver ?

[email protected] drivers/gpio: Altera soft IP GPIO driver ?

Kenneth Westfield ASoC: QCOM: Add support for ipq806x SOC ?

Device driver infrastructure

[email protected] FPGA Manager Framework ?

Rob Clark Atomic Properties (v2) ?

Chanwoo Choi [PATCHv5 0/9] devfreq: Add devfreq-event class to provide raw data for devfreq device ?

Kamil Debski HDMI-CEC framework ?

Documentation

Jonathan Corbet Docs: Modernize SubmittingPatches ?

Nicholas Mc Guire completion documentation proposal ?

Filesystems and block I/O

Dongsu Park simplify block layer based on immutable biovecs ?

Jan Kara quota: Unify VFS and XFS quota interfaces ?

Memory management

[email protected] HMM (Heterogeneous Memory Management) v7 ?

Kirill A. Shutemov mm: remove non-linear mess ?

Miscellaneous

Namhyung Kim perf tools: Speed-up perf report by using multi thread (v1) ?

Page editor: Jonathan Corbet
Next page: Distributions>>