LCE: The failure of operating systems and how we can fix it

Ready to give LWN a try?
With a subscription to LWN, you can stay current with what is happening in the Linux and free-software community and take advantage of subscriber-only site features. We are pleased to offer you a free trial subscription, no credit card required, so that you can see for yourself. Please, join us!

By Michael Kerrisk
November 14, 2012

The abstract of Glauber Costa's talk at LinuxCon Europe 2012 started with the humorous note "I once heard that hypervisors are the living proof of operating system's incompetence". Glauber acknowledged that hypervisors have indeed provided a remedy for certain deficiencies in operating system design. But the goal of his talk was to point out that, for some cases, containers may be an even better remedy for those deficiencies.

Operating systems and their limitations

Because he wanted to illustrate the limitations of traditional UNIX systems that hypervisors and containers have been used to address, Glauber commenced with a recap of some operating system basics.

In the early days of computing, a computer ran only a single program. The problem with that mode of operation is that valuable CPU time was wasted when the program was blocked because of I/O. So, Glauber noted "whatever equivalent of Ingo Molnar existed back then wrote a scheduler" in order that the CPU could be shared among processes; thus, CPU cycles were no longer wasted when one process blocked on I/O.

A later step in the evolution of operating systems was the addition of virtual memory, so that (physical) memory could be more efficiently allocated to processes and each process could operate under the illusion that it had an isolated address space.

However, nowadays we can see that the CPU scheduling and virtual memory abstractions have limitations. For example, suppose you start a browser or another program that uses a lot of memory. As a consequence, the operating system will likely start paging out memory from processes. However, because the operating system makes memory-management decisions at a global scope, typically employing a least recently used (LRU) algorithm, it can easily happen that excessive memory use by one process will cause another process to suffer being paged out.

There is an analogous problem with CPU scheduling. The kernel allocates CPU cycles globally across all processes on the system. Processes tend to use as much CPU as they can. There are mechanisms to influence or limit CPU usage, such as setting the nice value of a process to give it a relatively greater or lesser share of the CPU. But these tools are rather blunt. The problem is that while it is possible to control the priority of individual processes, modern applications employ groups of processes to perform tasks. Thus, an application that creates more processes will receive a greater share of the CPU. In theory, it might be possible to address that problem by dynamically adjusting process priorities, but in practice this is too difficult, since processes may come and go quite quickly.

The other side of the resource-allocation problem is denial-of-service attacks. With traditional UNIX systems, local denial-of-service attacks are relatively easy to perpetrate. As a first example, Glauber gave the following small script:

    $ while true; do mkdir x; cd x; done

This script will create a directory structure that is as deep as possible. Each subdirectory "x" will create a dentry (directory entry) that is pinned in non-reclaimable kernel memory. Such a script can potentially consume all available memory before filesystem quotas or other filesystem limits kick in, and, as a consequence, other processes will not receive service from the kernel because kernel memory has been exhausted. (One can monitor the amount of kernel memory being consumed by the above script via the dentry entry in /proc/slabinfo.)

Fork bombs create a similar kind of problem that affects unrelated processes on the system. As Glauber noted, when an application abuses system resources in these ways, then it should be the application's problem, rather than being everyone's problem.

Hypervisors

Hypervisors have been the traditional solution to the sorts of problems described above; they provide the resource isolation that is necessary to prevent those problems.

By way of an example of a hypervisor, Glauber chose KVM. Under KVM, the Linux kernel is itself the hypervisor. That makes sense, Glauber said, because all of the resource isolation that should be done by the hypervisor is already done by the operating system. The hypervisor has a scheduler, as does the kernel. So the idea of KVM is to simply re-use the Linux kernel's scheduler to schedule virtual machines. The hypervisor has to manage memory, as does the kernel, and so on; everything that a hypervisor does is also part of the kernel's duties.

There are many use cases for hypervisors. One is simple resource isolation, so that, for example, one can run a web server and a mail server on the same physical machine without having them interfere with one another. Another use case is to gather accurate service statistics. Thus, for example, the system manager may want to run top in order to obtain statistics about the mail server without seeing the effect of a database server on the same physical machine; placing the two servers in separate virtual machines allows such independent statistics gathering.

Hypervisors can be useful in conjunction with network applications. Since each virtual machine has its own IP address and port number space, it is possible, for example, to run two different web servers that each use port 80 inside different virtual machines. Hypervisors can also be used to provide root privilege to a user on one particular virtual machine. That user can then do anything they want on that virtual machine, without any danger of damaging the host system.

Finally, hypervisors can be used to run different versions of Linux on the same system, or even to run different operating systems (e.g., Linux and Windows) on the same physical machine.

Containers

Glauber noted that all of the above use cases can be handled by hypervisors. But, what about containers? Hypervisors handle these use cases by running multiple kernel instances. But, he asked, shouldn't it be possible for a single kernel to satisfy many of these use cases? After all, the operating system was originally designed to solve resource-isolation problems. Why can't it go further and solve these other problems as well by providing the required isolation?

From a theoretical perspective, Glauber asked, should it be possible for the operating system to ensure that excessive resource usage by one group of processes doesn't interfere with another group of processes? Should it be possible for a single kernel to provide resource-usage statistics for a logical group of processes? Likewise, should the kernel be able to allow multiple processes to transparently use port 80? Glauber noted that all of these things should be possible; there's no theoretical reason why an operating system couldn't support all of these resource-isolation use cases. It's simply that, historically, operating systems were not built with these requirements in mind. The only notable use case above that couldn't be satisfied is for a single kernel to run a different kernel or operating system.

The goal of containers is, of course, to add the missing pieces that allow a kernel to support all of the resource-isolation use cases, without the overhead and complexity of running multiple kernel instances. Over time, various patches have been made to the kernel to add support for isolation of various types of resources; further patches are planned to complete that work. Glauber noted that although all of those kernel changes were made with the goal of supporting containers, a number of other interesting uses had already been found (some of these were touched on later in the talk).

Glauber then looked at some examples of the various resource-isolation features ("namespaces") that have been added to the kernel. Glauber's first example was network namespaces. A network namespace provides a private view of the network for a group of processes. The namespace includes private network devices and IP addresses, so that each group of processes has its own port number space. Network namespaces also make packet filtering easier, since each group of processes has its own network device.

Mount namespaces were one of the earliest namespaces added to the kernel. The idea is that a group of processes should see an isolated view of the filesystem. Before mount namespaces existed, some degree of isolation was provided by the chroot() system call, which could be used to limit a process (and its children) to a part of the filesystem hierarchy. However, the chroot() system call did not change the fact that the hierarchical relationship of the mounts in the filesystem was global to all processes. By contrast, mount namespaces allow different groups of processes to see different filesystem hierarchies.

User namespaces provide isolation of the "user ID" resource. Thus, it is possible to create users that are visible only within a container. Most notably, user namespaces allow a container to have a user that has root privileges for operations inside the container without being privileged on the system as a whole. (There are various other namespaces in addition to those that Glauber discussed, such as the PID, UTS, and IPC namespaces. One or two of those namespaces were also mentioned later in the talk.)

Control groups (cgroups) provide the other piece of infrastructure needed to implement containers. Glauber noted that cgroups have received a rather negative response from some kernel developers, but he thinks that somewhat misses the point: cgroups have some clear benefits.

A cgroup is a logical grouping of processes that can be used for resource management in the kernel. Once a cgroup has been created, processes can be migrated in and out of the cgroup via a pseudo-filesystem API (details can be found in the kernel source file Documentation/cgroups/cgroups.txt).

Resource usage within cgroups is managed by attaching controllers to a cgroup. Glauber briefly looked at two of these controllers.

The CPU controller mechanism allows a system manager to control the percentage of CPU time given to a cgroup. The CPU controller can be used both to guarantee that a cgroup gets a guaranteed minimum percentage of CPU on the system, regardless of other load on the system, and also to set an upper limit on the amount of CPU time used by a cgroup, so that a rogue process can't consume all of the available CPU time. CPU scheduling is first of all done at the cgroup level, and then across the processes within each cgroup. As with some other controllers, CPU cgroups can be nested, so that the percentage of CPU time allocated to a top-level cgroup can be further subdivided across cgroups under that top-level cgroup.

The memory controller mechanism can be used to limit the amount of memory that a process uses. If a rogue process runs over the limit set by the controller, the kernel will page out that process, rather than some other process on the system.

The current status of containers

It is possible to run production containers today, Glauber said, but not with the mainline kernel. Instead, one can use the modified kernel provided by the open source OpenVZ project that is supported by Parallels, the company where Glauber is employed. Over the years, the OpenVZ project has been working on upstreaming all of its changes to the mainline kernel. By now, much of that work has been done, but some still remains. Glauber hopes that within a couple of years ("I would love to say months, but let's get realistic") it should be possible to run a full container solution on the mainline kernel.

But, by now, it is already possible to run subsets of container functionality on the mainline kernel, so that some people's use cases can already be satisfied. For example, if you are interested in just CPU isolation, in order to limit the amount of CPU time used by a group of processes, that is already possible. Likewise, the network namespace is stable and well tested, and can be used to provide network isolation.

However, Glauber said, some parts of the container infrastructure are still incomplete or need more testing. For example, fully functional user namespaces are quite difficult to implement. The current implementation is usable, but not yet complete, and consequently there are some limitations to its usage. Mount and PID namespaces are usable, but likewise still have some limitations. For example, it is not yet possible to migrate a process into an existing instance of either of those namespaces; that is a desirable feature for some applications.

Glauber noted some of the kernel changes that are still yet to be merged to complete the container implementation. Kernel memory accounting is not yet merged; that feature is necessary to prevent exploits (such as the dentry example above) that consume excessive kernel memory. Patches to allow kernel-memory shrinkers to operate at the level of cgroups are still to be merged. Filesystem quotas that operate at the level of cgroups remain to implemented; thus, it is not yet possible to specify quota limits on a particular user inside a user namespace.

There is already a wide range of tooling in place that makes use of container infrastructure, Glauber said. For example, the libvirt library makes it possible to start up an application in a container. The OpenVZ vzctl tool is used to manage full OpenVZ containers. It allows for rather sophisticated management of containers, so that it is possible to do things such as running containers using different Linux distributions on top of the same kernel. And "love it or hate it, systemd uses a lot of the infrastructure". The unshare command can be used to run a command in a separate namespace. Thus, for example, it is possible to fire up a program that operates in an independent mount namespace.

Glauber's overall point is that containers can already be used to satisfy several of the use cases that have historically been served by hypervisors, with the advantages that containers don't require the creation of separate full-blown virtual machines and provide much finer granularity when controlling what is or is not shared between the processes inside the container and those outside the container. After many years of work, there is by now a lot of container infrastructure that is already useful. One can only hope that Glauber's "realistic" estimate of two years to complete the upstreaming of the remaining container patches proves accurate, so that complete container solutions can at last be run on top of the mainline kernel.

Index entries for this article
Kernel	Containers
Kernel	Virtualization/Containers
Conference	LinuxCon Europe/2012

LCE: The failure of operating systems and how we can fix it

Posted Nov 15, 2012 7:50 UTC (Thu) by epa (subscriber, #39769) [Link] (8 responses)

Glauber Costa is quite right that hypervisors are a response to the failure of the operating system to provide enough isolation between different processes or different users on the same machine. But there is another important way in which the operating system has failed, and that is in providing an interface which is wide enough to be usable and yet narrow enough to be completely specified and dependable.

Often a large application will specify a particular Linux distribution or Windows version it is 'certified' to run on. The vendor may even insist that its application be the only thing running on the machine, if you want to get support. It may require particular versions of system libraries because those were the ones it was tested with. And yes, I am talking about big companies here, where stupid things are done for stupid big-organization reasons, and if you use free software and compile from source you are free of this nonsense, blah blah. But bear with me and assume that at least some of the time there is a legitimate reason to require an exact operating system version for running an application. (If you have ever worked on a support desk, you will find this reality easier to accept.)

So what we start to see are 'appliances' where the application is packaged up with its operating system ready to load into a virtual machine. Instead of supplying a program which calls the complex interface provided by the kernel, C library, and other system libraries, the vendor supplies one which expects the 'ABI' of an idealized x86-compatible computer. It has proved easier to agree on that than to agree on the higher level interfaces. Even though, somewhat absurdly, it means that TCP/IP and filesystems and virtual memory are all being reimplemented inside the 'appliance', it works out more robust this way.

LCE: The failure of operating systems and how we can fix it

Posted Nov 15, 2012 9:55 UTC (Thu) by robert_s (subscriber, #42402) [Link] (4 responses)

"it works out more robust this way."

Only because so far, little communication & cooperation between these appliances has been sought or required. If "appliances" are our new "processes" the fun is going to come when the equivalent of IPC is required.

And let's not even start talking about efficiency.

LCE: The failure of operating systems and how we can fix it

Posted Nov 15, 2012 9:58 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

For better or worse inter-application communication will end up as being TCP/IP rather than via the filesystem or local IPC mechanisms.

LCE: The failure of operating systems and how we can fix it

Posted Nov 15, 2012 16:04 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

TCP/IP? That's a little broad: JSON via HTTP requests on port 80 (it's left to the reader to provide an insecure implementation via HTTPS/443).

LCE: The failure of operating systems and how we can fix it

Posted Nov 22, 2012 6:08 UTC (Thu) by HelloWorld (guest, #56129) [Link] (1 responses)

What's the problem with IPC? The whole point of containers is to have better granularity: share the IPC namespace, but don't share the file system namespace so that you can use your own shared libraries.

LCE: The failure of operating systems and how we can fix it

Posted Nov 22, 2012 8:13 UTC (Thu) by Fowl (subscriber, #65667) [Link]

Shared libraries often use IPC...

Plus people want to use containers for more serious "untrusted" isolation.

LCE: The failure of operating systems and how we can fix it

Posted Nov 15, 2012 15:26 UTC (Thu) by raven667 (subscriber, #5198) [Link] (2 responses)

I think this is exactly right and is something I have noticed as well. I would also add that this is the reality of the micro kernel model, so in a way tannenbaum was right., micro kernels are the future, but we call them hypervisors.

LCE: The failure of operating systems and how we can fix it

Posted Nov 15, 2012 18:01 UTC (Thu) by drag (guest, #31333) [Link] (1 responses)

And with KVM we call hypervisors "Linux"

LCE: The failure of operating systems and how we can fix it

Posted Nov 15, 2012 20:09 UTC (Thu) by glommer (guest, #15592) [Link]

Which is a very big advantage of KVM, why I like it so much, and worked on it for so long.

But while you are reusing all of the infrastructure from the OS - awesome - you still have two schedulers, two virtual memory subsystems, two IO dispatchers, etc.

Containers, OTOH, are basically the Operating System taking resource isolation one step further, and allowing you to do all that without resorting to all the resource duplication you have with hypervisors - be your hypervisor your own OS or not.

Which of them suits you better, is up to you, your use cases, and personal preferences.

Conflicting goals?

Posted Nov 15, 2012 15:13 UTC (Thu) by NAR (subscriber, #1313) [Link] (4 responses)

It may be wise to constrain a (group of) processes to e.g. 80% of RAM and 80% of CPU - but what if there are no other processes on the computer? 20% of the RAM and 20% of the CPU is not used. For example when I'm playing with a game, I'd like it to use every bit of resource to get that 90 frames per second rate. How does it differ from a fork bomb from the outside?

Conflicting goals?

Posted Nov 15, 2012 16:04 UTC (Thu) by Jonno (subscriber, #49613) [Link] (2 responses)

The memory controllers allow for both soft and hard limits, so you can set a soft limit of 80% (so if the cgroup is above 80%, its memory will be the first to go) and a hard limit of 100% (so the cgroup get to use all memory no one else wants).

The CPU controller works slightly differently, you can sets a "shares" value, and *when cpu contention occurs* the cpu resources will be assigned proportionally. As the root cgroup defaults to 1024 shares, if you assign a shares value of 4096 to your cgroup (assuming there is no other cgroups), it will be limited to 80% of cpu time when there is contention, but be allowed to use more if no other process wants to be scheduled.

Conflicting goals?

Posted Nov 15, 2012 20:01 UTC (Thu) by glommer (guest, #15592) [Link] (1 responses)

Your description is precise and correct.

However, there are many scenarios where you actually want to limit the maximum amount of cpu used, even without contention. An example of this, is cloud deployments where you pay for cpu time and value price predictability over performance.

The cpu cgroup *also* allows one to set a maximum quota through the combination of the following knobs:

cpu.cfs_quota_us
cpu.cfs_period_us

If you define your quota as 50 % of your period, you will run for at most 50 % of the time. This is bandwidth based, in units of microseconds. So "use at most 2 cpus" is equivalent to 200 %. IOW, 2 seconds per second.
This is defaulted to -1, meaning "no upcap"

Equivalent mechanism exists for rt tasks: cpu.rt_quota_us, etc.

Cheers

Conflicting goals?

Posted Nov 19, 2012 3:14 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

How does/would the 200% percentage work with big.LITTLE?

Conflicting goals?

Posted Nov 15, 2012 16:23 UTC (Thu) by ottuzzi (guest, #74496) [Link]

Hi,

Solaris containers respects imposed limits only when there are competing resources.
Say you have container A with a CPU limit of 80 and a container B with a CPU limit of 70: if container A is not using CPU container B can use all the CPU available; same if B is doing anything on CPU A can use all the CPU.
But when both A and B are using all the CPU available then they are balanced proportionally to their weight.
I think Linux is implemented the same way.

Hope I was clear
Bye
Piero

LXC?

Posted Nov 15, 2012 18:55 UTC (Thu) by pbryan (guest, #3438) [Link] (4 responses)

> It is possible to run production containers today, Glauber said, but not with the mainline kernel. Instead, one can use the modified kernel provided by the open source OpenVZ project that is supported by Parallels, the company where Glauber is employed.

I was under the impression that it is possible to run production containers today with LXC. What functionality does OpenVZ provide have that is not supported by LXC, i.e. via cgroups, clone(2) isolation flags, devpts isolation mechanisms?

LXC?

Posted Nov 15, 2012 19:01 UTC (Thu) by xxiao (guest, #9631) [Link] (3 responses)

I also thought about the same(i.e. use LXC), until the sales pitch on OpenVZ comes out...

LXC?

Posted Nov 15, 2012 19:48 UTC (Thu) by glommer (guest, #15592) [Link] (2 responses)

You can use LXC to run containers on Linux, but whether you can go to "production" with it, depends on what "production" means to you.

There are many things that mainline Linux lacks. One of them, is the kernel memory limitation described in the article, that allows the host to protect against abuse from potentially malicious containers. It is trivial for a container to fill the memory with non-reclaimable objects, so no one else can be serviced.

User namespaces are progressing rapidly, but they are not there yet. Eric Biederman is doing a great job with that, patches are flowing rapidly, but you still lack a fully isolated capability system.

The pseudo file-systems /proc and /sys will still leak a lot of information from the host.

Tools like "top" won't work, because it is impossible to grab per-group figures of cpu usage. And this is not an extensive list.

So if "production" for you rely on any of the above, then no, you can't run LXC. If otherwise, then sure, you can run LXC.

Besides that, a lot of the kernel features that LXC relies on, were contributed for the OpenVZ project. So it is not like we're trying to fork the kernel, and keep people on our branch forever. It's just a quite big amount of work, the trade offs are not always clear for upstream, etc - It is no difference than Android in essence.

The ultimate goal, as stated in the article, is to have all the kernel functionality in mainline, so people can use any userspace tool they want.

Cheers

LXC?

Posted Nov 16, 2012 12:29 UTC (Fri) by TRS-80 (guest, #1804) [Link] (1 responses)

Having decent userspace tools is something else that's missing from the upstream kernel container implementation. The kernel has all these features now, but no coherent way of managing them nicely yet.

LXC?

Posted Nov 22, 2012 15:12 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Yeah, something like ezjail-admin would be nice for LXC. It'd make me consider using CentOS or Debian for my server instead of FreeBSD.

LCE: The failure of operating systems and how we can fix it

Posted Nov 15, 2012 21:31 UTC (Thu) by naptastic (guest, #60139) [Link] (2 responses)

Parallels is the company responsible for the travesty of horrors that is Plesk.

The hosting company I work for (who shall remain nameless) uses OpenVZ for virtual servers. It is wholly inadequate and we have begun transitioning to KVM.

It's still trivial to forkbomb a container running the VZ kernel and take down the whole host. The container model also requires a number of really, really annoying limits (shared memory, locked memory, open files, total files, TCP sockets, total sockets, and on and on) that have to be there because of the fundamental weakness of the container model.

You can get around that by using a container just for one thing, but then you still have to have full operating system just for that one thing. If I want to have memcache by itself in a container, I need a full file system and Linux install to support it. You lose some overhead by using a container instead of a hypervisor, then get it all right back plus some with the requirements of containment.

The only advantage I can see is that you can update the hardware configuration in realtime. Other than that, use cgroups or use full virtualization.

LCE: The failure of operating systems and how we can fix it

Posted Nov 16, 2012 9:11 UTC (Fri) by kolyshkin (guest, #34342) [Link] (1 responses)

I am really sorry for you negative experience with travesty of horrors. Nevertheless, this product has no common roots (or common developers, managers etc) with OpenVZ, so this is hardly relevant.

> It's still trivial to forkbomb a container running
> the VZ kernel and take down the whole host.

If that would be true, all the hosting service providers using VZ (i.e. a majority of) would go out of business very soon. If CT resources (user beancounters) are configured in a sane way (and they are configured that way by default -- so unless host system administrator removes some very vital limits), this is totally and utterly impossible.

So, let's turn words into actions. I offer you $100 (one hundred US dollars) for demonstrating a way to bring the whole system down using a fork bomb in OpenVZ (or Virtuozzo, for that matter) container. A reproducible description of what to do to achieve it is sufficient.

> The container model also requires a number of really, really annoying limits

I can feel your pain. Have you ever heard of vswap? Here are some 1-year-old news for you?
- https://plus.google.com/113376330521944789537/posts/5WEzA...
- http://wiki.openvz.org/VSwap

In a nutshell, you only need to set RAM and SWAP for container, and keep the rest of really, really annoying limits unconfigured.

> because of the fundamental weakness of the container model

Could you please enlighten us on what exactly is this fundamental weakness?

> You can get around that

Now, I won't be commenting on the rest of your entry because it is based on wrong assumptions.

LCE: The failure of operating systems and how we can fix it

Posted Nov 20, 2012 1:40 UTC (Tue) by exel (guest, #87380) [Link]

I think a lot of people still associate OpenVZ with Plesk/Virtuozzo. I must admit I haven't seen where that stuff has been going, but 6 years ago it was all pretty horrible.

There is, at the core of OpenVZ, something that seems absolutely elegant and useful – a more isolating take on the FreeBSD jail concept, which itself was of course chroot on acid. Using it for advanced process isolation looks like a sensible application. The _typical_ application of OpenVZ, in the field right now, though, is that of a poor man's hypervisor. I think that a container technology is just the wrong approach for this.

The big elephant in the room, for me, is security isolation. Containers all run under the same kernel, which means that a kernel compromise is a compromise of all attached security domains. An actual hypervisor setup adds an extra privilege layer that has to be separately broken.

Again, this doesn't mean that OpenVZ cannot be tremendously useful. The most visible way Parallels is selling the technology, however, is not what people are looking for. This pans out in the market place.

LCE: The failure of operating systems and how we can fix it

Posted Nov 16, 2012 17:22 UTC (Fri) by tbird20d (subscriber, #1901) [Link] (2 responses)

At the risk of exposing my ignorance, the containers approach seems like it's piling complexity on top of complexity. I view it somewhat akin to manual loop-unrolling. Sure, you can get some good performance benefits, and sometimes it's called for, but it makes the code more difficult to understand and is harder to maintain.

If the kernel is lightweight, then it seems like re-using it in recursive sort of way as a hypervisor, a'la KVM, seems like the more tractable long term approach, rather than adding lots of complexity to all these different code paths (basically, almost all of the major resource management paths in the kernel).

LCE: The failure of operating systems and how we can fix it

Posted Nov 16, 2012 18:27 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

The problem is that the kernel is not lightweight.

When you run a separate kernel to manage one app, you now have multiple layers of caching for example.

'fixing' this gets very quickly to where it's at least as much complexity.

LCE: The failure of operating systems and how we can fix it

Posted Nov 17, 2012 16:47 UTC (Sat) by tom.prince (guest, #70680) [Link]

There are a number of options to have light-weight bare-metal applications running in the VM, rather than full operating systems.

http://www.openmirage.org/ and https://github.com/GaloisInc/HaLVM are tools for doing this that come to mind.

LCE: The failure of operating systems and how we can fix it

Posted Nov 22, 2012 9:15 UTC (Thu) by hensema (guest, #980) [Link] (1 responses)

Costa forgets important uses of virtualisation:

Isolation of customers. Each customer can be king of his virtual machine. You'll never achieve this in a single OS instance.
Migration of VMs. You're not dependent on the hardware you've started the instance on. A VM can be migrated to faster/newer/more stable/cheaper/whatever hardware on-the-fly. It's common for VMs to have higher uptime than any physical hardware in the data centre.
Easy provisioning. Sure, you can do PXE boot and auto install an OS. But it'll never be as easy and flexible as provisioning a VM

Surely there's a lot to be won in process or user isolation in Linux itself. That'll both be useful when running on bare metal or on a HV. However Costa seems to want to go back to the hayday of mainframes and minis. That's not going to happen. Sorry.

LCE: The failure of operating systems and how we can fix it

Posted Nov 22, 2012 9:50 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

>Isolation

That is done acceptably by OpenVZ and Containers.

>Migration of VMs

Ditto. There's an article about checkpoint/restore for Linux containers in this week's LWN issue.

>Easy provisioning.
OpenVZ actually wins here. Provisioning an OpenVZ container is dead easy. Mass operations are also very easy.

LCE: The failure of operating systems and how we can fix it

Posted Nov 24, 2012 14:46 UTC (Sat) by jch (guest, #51929) [Link] (9 responses)

> Likewise, should the kernel be able to allow multiple processes to transparently use port 80?

That's pretty much orthogonal to resource usage, isn't it?

The issue here is that IP addresses don't obey the usual permissions model: I can chown a directory, thereby giving a user the right to create files and subdirectories within this particular directory, but I cannot chown an IP address, thereby giving a user the right to bind port 80 on this particular address.

I'd be curious to know if I'm the only person feeling that many of the uses of containers and virtualisation would be avoided if the administrator could chown an IP address (or an interface).

-- jch

LCE: The failure of operating systems and how we can fix it

Posted Nov 24, 2012 18:07 UTC (Sat) by dlang (guest, #313) [Link] (8 responses)

you can make it so that a normal user can bind to port 80, you can also setup iptables rules so that packets only flow on port 80 (or port 80 at a specific IP address) to and from processes running as a particular user.

it's not as trivial as a chown, but it's possible.

I think you are mixing up the purposes of using containers.

It's not the need to use port 80 that causes you to use containers, it's that so many processes that use port 80 don't play well with each other (very specific version dependencies that conflict) or are not trusted to be well written and so you want to limit the damage that they can do if they run amok (either due to local bugs, or do to external attackers)

containers don't give you as much isolation as full virtualization, but they do give you a lot of isolation (and are improving fairly rapidly), and they do so at a fraction of the overhead (both CPU and memory) of full virtualization.

If you have a very heavy process you are running, you may not notice the overhead of the virtualization, but if you have fairly lightweight processes you are running, the overhead can be very significant.

I'm not just talking about the CPU hit for running in a VM, or the memory hit from each VM having it's own kernel, but also things like the hit from each VM doing it's own memeory management, the hit (both CPU and memory) from each VM needing to run it's own copy of all the basic daemons (systemd/init, syslog, dbus, udev, etc) and so on.

If you are running single digit numbers of VMs on one system, you probably don't care about these overheads, but if you are running dozens to hundreds of VMs on one system, these overheads become very significant.

LCE: The failure of operating systems and how we can fix it

Posted Nov 24, 2012 18:59 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

> you can make it so that a normal user can bind to port 80, you can also setup iptables rules so that packets only flow on port 80 (or port 80 at a specific IP address) to and from processes running as a particular user.
>it's not as trivial as a chown, but it's possible.
How? So far I have tried:
1) Iptables - simply DoesNotWork(tm), particularly for localhost.
2) Redirectors - PITA to setup and often no IPv6 support.
3) Capabilities - no way to make it work with Python scripts or Java apps.

For now I'm using nginx as a full-scale HTTP proxy.

That restriction for <1024 ports is by far the most moronic stupid imbecilic UNIX feature ever invented.

LCE: The failure of operating systems and how we can fix it

Posted Nov 24, 2012 19:09 UTC (Sat) by bronson (subscriber, #4806) [Link]

It made sense in the 80s and 90s where the typical Unix host was serving tens to thousands of people and root tended to be trustworthy. No sysadmin wants his users competing to be the first to bind to port 79.

It's true that those days are long gone and it's time for this restriction to disappear.

LCE: The failure of operating systems and how we can fix it

Posted Nov 24, 2012 19:13 UTC (Sat) by dlang (guest, #313) [Link] (5 responses)

I don't have the time to look it up right now, but I remember seeing a /proc entry that removed the <1024 limit.

I agree that in the modern Internet, that really doesn't make sense, but going back, you had trusted admins (not just of your local box, but of the other boxes you were talking to), and in that environment it worked.

so think naive not moronic

remember, these are the same people who think that firewalls are evil because they break the unlimited end-to-end connectivity of the Internet. :-)

LCE: The failure of operating systems and how we can fix it

Posted Nov 24, 2012 19:21 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

>I don't have the time to look it up right now, but I remember seeing a /proc entry that removed the <1024 limit.
There is no such record (I naively thought so too). You can check the Linux source.

>I agree that in the modern Internet, that really doesn't make sense, but going back, you had trusted admins (not just of your local box, but of the other boxes you were talking to), and in that environment it worked.
A good mechanism would haven been to allow users access to a range of ports. Something simple like /etc/porttab with list of port ranges and associated groups would suffice.

>remember, these are the same people who think that firewalls are evil because they break the unlimited end-to-end connectivity of the Internet. :-)
I happen to think the same. Security should not be done on network's border, instead all the systems should be secured by local firewalls.

LCE: The failure of operating systems and how we can fix it

Posted Nov 24, 2012 19:26 UTC (Sat) by dlang (guest, #313) [Link] (2 responses)

>> I don't have the time to look it up right now, but I remember seeing a /proc entry that removed the <1024 limit.

> There is no such record (I naively thought so too). You can check the Linux source.

Ok, I thought I remembered seeing it at some point in the past, I may have mixed it up with the ability to bind to IP addresses that aren't on the box <shrug>

I wonder how quickly someone could whip up a patch to add this ;-)

seriously, has this been discussed and rejected, or has nobody bothered to try and submit something like this?

LCE: The failure of operating systems and how we can fix it

Posted Nov 24, 2012 19:58 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Yes, I remember that such patch was proposed way back then and rejected. And now it will be further complicated by cgroups/namespaces/whatever.

LCE: The failure of operating systems and how we can fix it

Posted Jun 29, 2014 8:50 UTC (Sun) by stevenp129 (guest, #97662) [Link]

what about the fact that if a user could bind to a low level port, they could take advantage of race conditions, and put up a web server (or proxy) in place of the intended appliance?

if user BOB wrote a program to constantly monitor Apache, and the second its PID dies, he was to fire up his own web server on port 80, he could steal sensitive information and password (with great ease).

on a shared hosting service (for example), if somebody neglected to update their CMS to the latest version, and the host runs their webserver without a Chroot... a simple bug or exploit in a website could, in turn, allow a rogue PHP or CGI Script to take over the entire server! not good!

or imagine your DNS server going down! due to a hostile take over... they could redirect traffic to their own off site server, and perform phishing attacks against you and all your clients this way!

Of course there are legitimate reasons to forbid those without privs to bind to ports less that 1024... I'm not sure what is so "stupid" about this idea?

LCE: The failure of operating systems and how we can fix it

Posted Jan 10, 2013 12:03 UTC (Thu) by dps (guest, #5725) [Link]

Putting my system admin hat on I want both border *and* host security. There is a lot that makes sense to block at borders because outsiders have no business using them. Servers on the safe side of firewalls often have to have more services configured and are therefore less secure.

If both a border firewall blocks some attack traffic then a security bug on an internal system is not immediately fatal and there is time to fix it before the border firewall's security is breached. If that has not happened that implies nobody worthwhile has tried or you can't detect security breaches.

In an ideal world there would be no need for security because nobody would even think of doing a bad dead. The world has never been that way.

LCE: The failure of operating systems and how we can fix it

Posted Aug 4, 2013 3:44 UTC (Sun) by rajneesh (guest, #92204) [Link]

How do you give a minimum guarantee using cgroups? cpu.cfs_quota_us is just an upper limit on the cpu usage of the process using bandwidth control? Can you please explain how did you achieve a minimum guarantee of cpu cycles?