Kernel development
Brief items
Kernel release status
The current development kernel is 3.19-rc1, released on December 20 — one day earlier than might have been expected.Stable updates: none have been released in the last week.
Live kernel patching on track for 3.20
Uptime-sensitive users have long been interested in the ability to apply patches to a running kernel without having to reboot the system. There are several out-of-tree implementations of this feature out there; it has always been clear that they could not all get into the mainline. According to a recent request from Jiri Kosina to have a live-patching tree added to linux-next, it seems that some progress has been made on this front. In particular, the developers of kpatch and kGraft have been working together:
The current plan is to try to get this common core pulled during the 3.20 merge window. Then, perhaps, we'll finally have live-patching capability in the mainline kernel.
Kernel development news
The end of the 3.19 merge window
By the usual practice, the 3.19 merge window should have ended on December 21, two weeks after its opening. But the solstice celebration featuring 3.19-rc1 was not to be; Linus decided to close the merge window one day early. Among other things, his reasoning was that so much work had come in that there wasn't a whole lot of point in waiting for more:
Indeed, this is a busy development cycle, with 11,408 changesets having been pulled during the merge window. That is more than was seen during the entire 3.18 development cycle, where 11,379 changesets were merged before the final release.
About 1000 changesets were pulled since last week's summary; some of the more interesting, user-visible changes in that set were:
- The handling of the setgroups() system call in user
namespaces has been changed in a way that could possibly break some
applications; see this article for
more information.
- The Ceph filesystem now supports inline data, improving performance
for small files. Ceph also supports message signing for
authentication between clients and servers.
- KVM virtualization support for the Itanium (ia64) architecture has
been removed.
It was not being maintained, and, seemingly, was not being used
either.
- The InfiniBand layer has gained support for on-demand paging. This
feature allows an RDMA
region to be set up and populated via page
faults when the memory is actually used, thus avoiding the need to pin
down a bunch of memory that may never be needed.
- New hardware support includes:
- Input:
Elan I2C/SMbus trackpads,
Goodix I2C touchscreens, and
Elan eKTH I2C touchscreens.
- Miscellaneous:
Allwinner SoC-based NAND flash,
Broadcom BCM2835 pulse-width modulator (PWM) controllers, and
Atmel HLCDC PWM controllers.
- Thermal: Support for generic device cooling through clock frequency tweaking, NVIDIA Tegra SOCTHERM thermal management systems, and Rockchip thermal management systems.
- Input:
Elan I2C/SMbus trackpads,
Goodix I2C touchscreens, and
Elan eKTH I2C touchscreens.
Changes visible to kernel developers include:
- The module removal code has been reworked to get rid of calls to the
much-maligned stop_machine() function. In the process,
module reference counting has been slowed slightly, but it is believed
that nobody is tweaking reference counts often enough that the
difference will be noticed.
- The CONFIG_PM_RUNTIME configuration symbol has finally been
eliminated from the kernel; everything uses CONFIG_PM now.
- The READ_ONCE() and ASSIGN_ONCE() macros (described
in this article) were merged in
the final pull before the closing of the merge window. These macros
enforce the use of
non-scalar types, hopefully avoiding experiences with tricky compiler bugs.
If the usual schedule holds, the final 3.19 release can be expected sometime around the middle of March. Between now and then, though, a lot of testing and fixing has to be done — 11,400 new changes will certainly have brought a few bugs with them. That said, the early indications are that 3.19-rc1 is relatively stable for such an early release, so this may yet turn out to be another fast cycle despite the volume of patches.
CoreOS looks to move from Btrfs to overlayfs
After many years of different union filesystem implementations trying to get into the mainline, the overlay filesystem (also known as overlayfs) was finally merged for 3.18. It didn't take all that long for at least one project to notice and react to that addition. CoreOS, which is a Linux distribution for large server deployments, is now planning to move its root filesystem images from Btrfs to ext4—with overlayfs on top.
Various filesystem features are used by Docker (which is the mechanism used to run applications on CoreOS—at least for now) to put read-write filesystems atop the read-only base that provides the root filesystem. In addition, Docker applications each have their own read-only filesystem layer that is currently typically handled in CoreOS by using Btrfs's copy-on-write and snapshot features. Moving to a union filesystem like overlayfs will provide the same basic functionality, just using different underlying techniques.
Brandon Phillips proposed the switch to the
coreos-dev mailing list on
December 15 and the reaction has generally been quite positive. Btrfs
is, it seems, still a bit immature. As Phillips noted:
"CoreOS users have regularly reported bugs against btrfs including:
out
of disk space errors, metadata rebalancing problems requiring manual
intervention and generally slow performance when compared to other
filesystems
".
That proposal was greeted by responses from several others who had seen the problems that Phillips mentioned. Seán C. McCord pointed out that he is a Btrfs proponent, but would still be happier using ext4 and overlayfs:
But, in the only real opposition seen in the thread, Morgaine Fowle noted his excitement about the features that Btrfs brings to the table and thinks CoreOS should be focusing on those, rather than what he sees as a cobbled-together solution using overlayfs. Furthermore:
But, according to Phillips's proposal, overlayfs will bring some benefits beyond just more stability. He pointed to a Red Hat evaluation of storage options for Docker that showed overlayfs as a faster choice for creating and destroying containers. In addition, it also said that overlayfs uses memory more efficiently since it can keep a single copy of a file's pages in the page cache, which can then be used by multiple containers. Since there tends to be a lot of overlap between containers, this can result in significant performance improvements. There are some downsides to overlayfs, too, of course, including that changes to files in the underlying read-only layer requires a potentially costly copy-up operation.
Btrfs creator Chris Mason also posted to
the thread. He noted that a number of the problems ("warts
")
that CoreOS users were running into are being addressed:
Overall, though, Mason was not particularly disappointed or unhappy about the proposal to switch to overlayfs, saying that CoreOS should choose the storage solution that best fits its needs. He was also glad to see projects looking to adopt overlayfs now that it has been added to the kernel. Similarly, Greg Kroah-Hartman congratulated CoreOS for using overlayfs in a post to Google+.
The main change outlined by Phillips would be to move the root filesystem images from Btrfs to ext4. Eventually, the Docker overlayfs graph backend would be made the default, but existing Btrfs-based CoreOS systems would continue to work as they are. Given that there were almost no complaints about the proposal, with multiple posts agreeing (as well as quite a few "+1" posts), it would appear to be the path forward for CoreOS.
It should be noted that overlayfs itself has only been in the kernel for a short time. The patches been around for quite a while now, and have been used by various distributions along the way, but it probably still has a few bugs that will need to be shaken out. It is far less complex than Btrfs, however, which presumably reduces the risks of switching from one immature storage technology to another. At this point, openSUSE is the only major distribution to have adopted Btrfs as its default filesystem, though others have discussed it.
One conclusion seems inevitable, though: even after many years of development, Btrfs has not reached a level of functionality, performance, and stability required by some. Mason's message provides some hope that we are getting there, but that has seemingly been true for a while now. When we actually get "there" is still anyone's guess at this point.
The "too small to fail" memory-allocation rule
Kernel developers have long been told that, with few exceptions, attempts to allocate memory can fail if the system does not have sufficient resources. As a result, in well-written code, every call to a function like kmalloc(), vmalloc(), or __get_free_pages() is accompanied by carefully thought-out error-handling code. It turns out, though, the behavior actually implemented in the memory-management subsystem is a bit different from what is written in the brochure. That difference can lead to unfortunate run-time behavior, but the fix might just be worse.A discussion on the topic began when Tetsuo Handa posted a question on how to handle a particular problem that had come up. The sequence of events was something like this:
- A process that is currently using relatively little memory invokes
an XFS filesystem operation that, in turn, needs to perform an
allocation to proceed.
- The memory management subsystem tries to satisfy the allocation, but
finds that there is no memory available. It responds by first trying
direct reclaim (forcing pages out of memory to free them), then, if
that doesn't produce the needed free memory, it falls back to the
out-of-memory (OOM) killer.
- The OOM killer picks its victim and attempts to kill it.
- To be able to exit, the victim must perform some operations on the same XFS filesystem. That involves acquiring locks that, as it happens, the process attempting to perform the problematic memory allocation is currently holding. Everything comes to a halt.
In other words, the allocating process cannot proceed because it is waiting for its allocation call to return. That call cannot return until memory is freed, which requires the victim process to exit. The OOM killer will also wait for the victim to exit before (possibly) choosing a second process to kill. But the victim process cannot exit because it needs locks held by the allocating process. The system locks up and the owner of the system starts to seriously consider a switch to some version of BSD.
When asked about this problem, XFS maintainer Dave Chinner quickly wondered why the memory-management code was resorting to the OOM killer rather than just failing the problematic memory allocation. The XFS code, he said, is nicely prepared to deal with an allocation failure; to him, using that code seems better than killing random processes and locking up the system as a whole. That is when memory management maintainer Michal Hocko dropped a bomb by saying:
The resulting explosion could be heard in Dave's incredulous reply:
Lots of code has dependencies on memory allocation making progress or failing for the system to work in low memory situations. The page cache is one of them, which means all filesystems have that dependency. We don't explicitly ask memory allocations to fail, we *expect* the memory allocation failures will occur in low memory conditions. We've been designing and writing code with this in mind for the past 15 years.
A "too small to fail" allocation is, in most kernels, one of eight contiguous pages or less — relatively big, in other words. Nobody really knows when the rule that these allocations could not fail went into the kernel; it predates the Git era. As Johannes Weiner explained, the idea was that, if such small allocations could not be satisfied, the system was going to be so unusable that there was no practical alternative to invoking the OOM killer. That may be the case, but locking up the system in a situation where the kernel is prepared to cope with an allocation failure also leads to a situation where things are unusable.
One alternative that was mentioned in the discussion was to add the __GFP_NORETRY flag to specific allocation requests. That flag causes even small allocation requests to fail if the resources are not available. But, as Dave noted, trying to fix potentially deadlocking requests with __GFP_NORETRY is a game of Whack-A-Mole; there are always more moles, and they tend to win in the end.
The alternative would be to get rid of the "too small to fail" rule and
make the allocation functions work the way most kernel developers expect
them to. Johannes's message included a patch moving things in that
direction; it causes the endless reclaim loop to exit (and fail an
allocation request) if attempts at direct reclaim do not succeed in
actually freeing any memory. But, as he put it, "the thought of
failing order-0 allocations after such a long time is scary.
"
It is scary for a couple of reasons. One is that not all kernel developers are diligent about checking every memory allocation and thinking about a proper recovery path. But it is worse than that: since small allocations do not fail, almost none of the thousands of error-recovery paths in the kernel now are ever exercised. They could be tested if developers were to make use of the the kernel's fault injection framework, but, in practice, it seems that few developers do so. So those error-recovery paths are not just unused and subject to bit rot; chances are that a discouragingly large portion of them have never been tested in the first place.
If the unwritten "too small to fail" rule were to be repealed, all of those error-recovery paths would become live code for the first time. In a sense, the kernel would gain thousands of lines of untested code that only run in rare circumstances where things are already going wrong. There can be no doubt that a number of obscure bugs and potential security problems would result.
That leaves memory-management developers in a bit of a bind. Causing memory allocation functions to behave as advertised seems certain to introduce difficult-to-debug problems into the kernel. But the status quo has downsides of its own, and they could get worse as kernel locking becomes more complicated. It also wastes the considerable development time that goes toward the creation of error-recovery code that will never be executed. Even so, introducing low-order memory-allocation failures at this late date may well prove too scary to be attempted, even if the long-term result would be a better kernel.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>