Device namespaces

We're bad at marketing
We can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.

By Jake Edge
August 28, 2013

Lightweight virtualization using containers is a technique that has finally come together for Linux, though there are still some rough edges that may need filing down. Containers are created by using two separate kernel features: control groups (cgroups) and namespaces. Cgroups are in the process of being revamped, while there may still need to be more namespaces added to the varieties currently available. For example, there is no way to separate most devices into their own namespace. That's a hole that Oren Laadan would like to see filled, so he put out an RFC on device namespaces recently.

Namespaces partition global system resources so that different sets of processes have their own view of those resources. For example, mount namespaces partition the mounted filesystems into different views, with the result that processes in one namespace are unable to see or interact with filesystems that are only mounted in another. Similarly, PID namespaces give each namespace its own set of process IDs (PIDs). Certain devices are currently handled by their own namespace or similar functionality: network namespaces for network devices and the devpts pseudo-filesystem for pseudo-terminals (i.e. pty). But there is no way to partition the view of all devices in the system, which is what device namespaces would do.

The motivation for the feature is to allow multiple virtual phones on a single physical phone. For example, one could have two complete Android systems running on the phone, with one slated for work purposes, and the other for personal uses. Each system would run in its own container that would be isolated from the other. That isolation would allow companies to control the apps and security of the "company half" of a phone, while allowing the user to keep their personal information separate. A video gives an overview of the idea. Much of that separation can be done today, but there is a missing piece: virtualizing the devices (e.g. frame buffer, touchscreen, buttons).

The proposal adds the concept of an "active" device namespace, which is the one that the user is currently interacting with. The upshot is that a user could switch between the phone personalities (or personas) as easily as they switch between apps today. Each personality would have access to all of the capabilities of the phone while it was the active namespace, but none while it was the inactive (or background) namespace.

Setting up a device namespace is done in the normal way, using the clone(), setns(), or unshare() system calls. One surprise is that there is no new CLONE_* flag added for device namespaces, and the CLONE_NEWPID flag is overloaded. A comment in the code explains why:

    /*
     * Couple device namespace semantics with pid-namespace.
     * It's convenient, and we ran out of clone flags anyway.
     */

While coupling PID and device namespaces may work, it does seem like some kind of long-term solution to the clone flag problem is required. Once a process has been put into a device namespace, any open() of a namespace-aware device will restrict that device to the namespace.

At some level, adding device namespaces is simply a matter of virtualizing the major and minor device numbers so that each namespace has its own set of them. The major/minor numbers in a namespace would correspond to the driver loaded for that namespace. Drivers that might be available to multiple namespaces would need to be changed to be namespace-aware. For some kinds of drivers, for example those without any real state (e.g. for Android, the LED subsystem or the backlight/LCD subsystem), the changes would be minimal—essentially just a test. If the namespace that contains the device is the active one, proceed, otherwise, ignore any requested changes.

Devices, though, are sometimes stateful. One can't suddenly switch sending frame buffer data mid-stream (or mix two streams) and expect the screen contents to stay coherent. So, drivers and subsystems will need to handle the switching behavior. For example, the framebuffer device should only reflect changes to the screen from the active namespace, but it should buffer changes from the background namespace so that those changes will be reflected in the display after a switch.

Laadan and his colleagues at Cellrox have put together a set of patches based on the 3.4 kernel for the Android emulator (goldfish). There is also a fairly detailed description of the patches and the changes made for both stateless and stateful devices. An Android-based demo that switches between a running phone and an app that displays a changing color palette has also been created.

So far, there hasn't been much in the way of discussion of the idea on the containers and lxc-devel mailing lists that the RFC was posted to. On one hand, it makes sense to be able to virtualize all of the devices in a system, but on the other that means there are a lot of drivers that might need to change. There may be some "routing" issues to resolve, as well—when the phone rings, which namespace handles it? The existing proof-of-concept API for switching the active namespace would also likely need some work.

While it may be a worthwhile feature, it could also lead to a large ripple effect of driver changes. How device namespaces fare in terms of moving toward the mainline may well hinge on others stepping forward with additional use cases. In the end, though, the core changes to support the feature are fairly small, so the phone personality use case might be enough all on its own.

Index entries for this article
Kernel	Namespaces/Device namespaces

Device namespaces

Posted Aug 29, 2013 5:18 UTC (Thu) by kugel (subscriber, #70540) [Link] (3 responses)

Can there only be one active device namespace at a time? That's probably sufficient for android but seems to ignore multi-seat systems entirely.

I could easily imagine that for example a shared server has N of the same pci devices (e.g. graphic cards), each of which is allocated to a single user (out of N users) which may be connected at the same time though remote desktop.

Device namespaces

Posted Aug 29, 2013 11:46 UTC (Thu) by amir73il (subscriber, #66165) [Link] (2 responses)

The current patch set supports a single active device namespace, which serves the use case of virtual phone, but the extension to N active device namespaces would be natural, should there be use cases that require it.

One of the main reasons that we posted our work is to hear from other people on their use cases, so we can cover those cases in our future releases.

Are you aware of any active projects that can make use of N active device namespaces if that code would be posted?

Device namespaces

Posted Aug 29, 2013 16:09 UTC (Thu) by JohnLenz (guest, #42089) [Link] (1 responses)

This article discusses recent multiseat device plans. Instead of device namespaces they will not allow any sessions to directly open devices. Instead, the devices will be opened in systemd and the FD passed to the session. To me this seems like a much better solution than device namespaces, since the userspace daemon (systemd in this case) can have device specific knowledge about how to idle one FD and enable another so that the device switches between sessions. With device namespaces all this must be stuck into the kernel for every driver.

Device namespaces

Posted Aug 30, 2013 5:40 UTC (Fri) by amir73il (subscriber, #66165) [Link]

Very good article!
We see device namespace as complementary to existing solutions which apply restrictions to device access.

It really depends on the use case, whether restricting device access is sufficient or if a certain device needs to be virtualized.
For example, in the use case discussed in the article, graphic-servers of all sessions need to either cooperate with logind (by acknowledging the switch) or gracefully handle EACCESS errors while in an inactive session.