This is going to be a longer blog entry, but here’s a TL;DR:
I propose that instead of "immutable" or "read-only" when talking about operating systems (such as Fedora CoreOS, Google COOS, Flatcar etc.), we use these terms:
- "fully managed": The system does not have "unmanaged state" – e.g. an admin interactively doing
ssh
and making changes not recorded declaratively somewhere else
- "image based": Traditional package managers end up with a lot of "hidden state" (related to above); image based updates avoid that
- "reprovisionable" and not a "pet": I don’t like the industry "pets vs cattle" term, and I think "reprovisionable" is both nicer and more descriptive
- "Has anti-hysteresis properties": (Yes I know this is an awkward term) See https://en.wikipedia.org/wiki/Hysteresis – I’ll talk more about this later
(Terminology note: In this article also I will use the abbreviation "pkgmgrs" for "traditional package managers like apt/yum". Systems like NixOS and some aspects of swupd
from Clear Linux improve parts of what I’m talking about, but this article is already really long and a detailed comparison including those really deserves a separate post)
Why not "immutable"/"read-only"?
Because it’s very misleading. These system as a whole is not immutable, or read-only, or stateless – there are writable, persistent data areas. And more importantly, those writable data areas allow persistently storing privileged code. They have to because these OSes need to support:
- the user being root on their own computer
- In place OS updates
(What about systems that don’t support "in place" updates? Yes, there are people/organizations who e.g. build a new cloud image for every change, and often don’t even enable ssh
or any persistent writable state for the OS. This is fine, but one problem is it doesn’t generally apply outside of cloud/IaaS environments on e.g. bare metal machines, and it can make upgrades for small changes very disproportionately expensive. )
Back to operating systems with state that some people call "immutable":
But /usr is read-only!
Yes. And this does have some security benefits, e.g. this runc vulnerability isn’t exploitable.
But in order for the operating system to be updated in place, there must be some writable area to add new OS content – so it’s not immutable. The details of this vary; a number of "image based" operating systems use dual partitions, OSTree is based on hardlinking with a "hidden" writable data store.
The real reason to have a read-only /usr
is to make clear that the content of that directory (the operating system binaries) are "fully managed" or "owned" by the OS creator – you shouldn’t try to overwrite or replace parts of it because those changes could be overwritten by a future update.
And this "changes in /usr being overwritten" is a real existing problem with traditional package-manager systems (pkgmgrs). For example, a while ago I was looking at Keylime and came across this bit in the installer. That change would be silently overwritten by the next yum/apt
update, so the system administrator experience would be:
- Provision system
- Install things (including keylime)
- ⌛ Time passes
- Apply OS updates (not on by default), then keylime breaks for a not obvious reason
The more correct thing instead would be for that playbook to write a systemd drop in in /etc
to override just ExecStart=
, although even doing that is fragile and what’d be best here is to make this an explicitly configurable option for tpm2-abrmd
in a config file in /etc
.
The overall point is that the reason /usr
read-only is primarily to enforce that user configuration is cleanly separate from the OS content – which becomes particularly important when OS updates are automatic by default, as they are in Fedora CoreOS.
I think having automatic updates on by default fundamentally changes the perception of responsibility around updates; if I’m a system administrator and I typed apt/yum update
and things broke, it’s my fault, but if automatic updates are on by default and I’m doing something else and the machine just falls over – it’s the OS vendor’s fault. Linking these two together: Since Fedora CoreOS has automatic updates on, we really need to be clear what’s our responsibility and what’s yours.
Now, this isn’t a new problem, and most people maintaining systems know not to do the kinds of things that Keylime Ansible playbook is doing. But it’s an extremely easy mistake to make without strong discipline when /usr
is sitting there writable by any process that runs as root. I’ve seen many, many examples of this.
Nothing actually stops traditional package managers from mounting /usr
read-only by default – they could do the equivalent of unshare -m /bin/sh -c 'mount -o remount,rw /usr && apt update
‘ internally. But the challenges there grow into adjusting the rest of the filesystem layout to handle a readonly /usr
, such as how OSTree suggests moving /usr/local to /var/usrlocal etc.
Image based updates
Usually instead of talking about an "immutable" system that allows in place updates, it’d be more useful and accurate to say "image based".
And this gets into another huge difference between traditional package managers and image based systems: The amount of "internal state".
The way most package managers work is when you type $pkgmgr install foo
, the fact that you want foo
installed is recorded by adding it to the database. But the package manager database also includes a whole set of "base packages" that (usually) you didn’t choose. Those "base packages" might come from a base container when you podman/docker pull
, for cloud images the default image, and physical systems they often come from a distribution-specific default list embedded/downloaded from the ISO or equivalent.
A problem with this model then is "drift" – by default if the distribution decides to add a package to the base set by default, you (usually) don’t get it by default when applying in place updates since most package managers just update the set of packages you have. One solution to this is metapackages, but if not everything in the base is covered you still have drift that can be hard to notice over time.
I think for users of many pkgmgrs this "initial state" is hard to disentangle from things you typically do care about like the packages you chose to install. There is e.g. apt-mark showmanual
and dnf history userinstalled
commands.
And…trying that out by pulling the docker.io/debian:stable
image, it claims:
# apt-mark showmanual
iproute2
iputils-ping
#
And that’s the first command I ran in the image! Clearly a bug somewhere. For the fedora:32
base image it lists a bunch of packages that correspond to the bits in the base kickstart – but that’s not something I as the user wrote.
By analogy with /usr
vs /etc
– this is like mixing local configuration in /usr
.
This problem extends beyond the "user installed" database: traditional package managers aren’t aware of the "base bootimage" which operates on a separate infrastructure layer. apt
has no idea about the of the OpenStack image/AMI/qcow2 or whatever that formed its initial state, nor is it aware of the OCI/docker container initial image (and conversely, e.g. podman/docker have no idea that yum/apt
etc. are running inside).
So over time, the state of the system with traditional pkgmgrs is a function of many things:
- Which packages you chose to install (obviously)
- The set of packages from the initial "bootimage" or container image
- More subtle things like which packages are in the "user installed" database
- Even more subtle things can happen when weak dependencies like
Recommends
change in upstream packages
- The package manager version: RHEL8
yum
has autoremove on by default, RHEL7 and older yum
doesn’t
One solution to this type of "drift" is to not use packages at all (pure "base OS" + "apps/containers") like Google COOS, or to group things at a higher level (Clear Linux is more in this bucket).
I’m pretty happy though with the design we came up with for rpm-ostree used by Fedora CoreOS/Silverblue/IoT; there is a clear "base commit" that comes in OSTree format, and you can add packages on top – recasting RPMs as "operating system extensions" (see also this OpenShift enhancement).
For rpm-ostree it’s really simple – by default it operates in pure ostree mode by default, so if you don’t layer/override any packages you are exactly replicating an ostree commit – and that’s it! You don’t need to think about packages by default.
Particularly for Fedora CoreOS, there is almost nothing in the "bootimage" (ISO, AMI equivalent) that isn’t part of the ostree commit.
In other words, "state of installed software" is a function of (effectively) one thing by default:
It’s even stronger than that really, it’s not just "same packages" it’s "bit for bit identical /usr
filesystem". However, there is one important note: /boot
does come from the bootimage, see this issue.
Bootloader aside, effectively all of the OS state you care about then does not depend on which bootimage you happend to use to install initially. When OSTree performs an update, it does not matter what the "previous" commit was – the old and new implicitly share files via the hardlink store, but updates always involve a "fresh checkout" of the new commit. Every upgrade is like a fresh OS install of that version with your configuration (/etc
) and state (/var
) re-applied.
With rpm-ostree
being a hybrid system, you can choose to engage package layering (or overrides). But the system very clearly highlights that list; note a major simplification is combining the "packages you installed" and "user installed" lists. The rpm-ostree model is very simple: you have a "base commit/image" and your extensions. For example:
$ rpm-ostree status -b
State: idle
BootedDeployment:
● ostree://fedora/32/x86_64/silverblue
Version: 32.2 (2020-08-22T17:28:53Z)
BaseCommit: 080312021f34c7763089ff12fcd2964647e0f55ac3981f869b56d232a33990f6
LayeredPackages: fish libvirt tmux virt-manager
An important but subtle detail in achieving this simplification: by default, rpm-ostree doesn’t allow marking a base package as user installed. Generally the idea is that removing user-interesting packages from the base image is something you shouldn’t do.
rpm-ostree goes to some lengths internally to make this split happen; the libdnf/rpm layers don’t have any model of "base image" because everything’s a package to them.
Has anti-hysteresis properties
I know "has anti-hysteresis properties" is an awkward phrase (and I’m happy to hear alternatives) but I think hysteresis is a great term that we should start using in computing. Today it seems to mostly be used in the sciences but I propose adopting it – this in the spirit of making computer science more like a real science.
Let’s take a look specifically at elastic hysteresis because it’s easy to understand and even try at home.
Basically, rubber bands have "hysteresis" ("hidden state"/"memory") which comes from how much it was stretched in the past. And this state is basically impossible to see by just looking at the rubber band. For a related example with rubber, see the two balloon experiment.
To tie together the previous section on package managers with this:
Systems managed by traditional package managers (apt/yum/etc) have a lot of effective hysteresis. I think even many experienced system administrators would have trouble confidently and precisely explaining how the multiple things listed above (the container or IaaS base image, package manager user installed database, etc.) all interact in forming the final state of the system over time as in-place upgrades are applied.
Configuration management systems and hysteresis
This "hysteresis" problem occurs not just in package managers but also many configuration management systems (puppet/ansible/etc).
A simple example I’ve seen happen is where the system administrator writes a playbook (or equivalent) that does e.g.:
- name: Allow nopasswd for wheel
lineinfile:
path: /etc/sudoers
state: present
regexp: '^%wheel ALL='
line: '%wheel ALL=(ALL) NOPASSWD: ALL'
Then later, say the organization wants to change to use a separate group instead of wheel
, say admins
or whatever.
If the playbook is changed in git to do:
- group:
name: admin
state: present
- name: Allow nopasswd for admins
lineinfile:
path: /etc/sudoers
state: present
regexp: '^%admin ALL='
line: '%admin ALL=(ALL) NOPASSWD: ALL'
The previous change to modify wheel
in /etc/sudoers
will silently persist (until the system is reprovisioned). And that could become a security problem even in this case.
In most of these configuration management systems, in some cases the admin may need to explicitly add a change which reverts a prior change, and then makes the new change. But not all of the time – some (most) changes don’t need this.
It’s an easy mistake to make when writing effectively arbitrary code to change files in persistent filesystems.
Hence, configuration management systems are subject to hysteresis too, and I think many of them could do better in warning users about this, and pushing for better practices. For example, the playbook would be more "anti-hysteresis" if it wrote to /etc/sudoers.d/mycustom.conf
which gets replaced entirely, though /etc/sudoers.d
is only supported by relatively modern sudo I think.
Kubernetes is fairly opinionated in having code in container images you pull (equivalent of /usr
), and storing configuration in a configmap
(which would get projected into environment variables or files in /etc
). When you update a deployment
, all state in the (sadly writable by default) pod container filesystem is thrown away, and there’s also no leakage from any previous version of a configmap. So we could say that the Kubernetes approach to applications has strong "anti-hysteresis properties".
The OpenShift Machine Config Operator defaults to anti-hysteresis
Tying together the Kubernetes and operating system threads: in OpenShift 4, the machine-config-operator allows you to write config files and systemd units into the operating system /etc
by using kubectl/oc
. (The original goal of etcd was in fact to do this, then Kubernetes happened and the focus shifted to that layer. In OpenShift 4 we are meeting that original goal of storing the Unix /etc
in etcd
via the MCO.)
The reason I claim the MCO has "anti-hysteresis" is it keeps track of the old and new system states reliably and is able to diff them. For example, if you write a config file for chrony
to set the timeserver, then later kubectl delete machineconfig/my-chrony-config
since you’re fine with the default, the MCO will notice that the old config wrote /etc/chrony.conf
and the new one doesn’t, and it will correctly revert the file back to the default.
Just like OSTree has a checksum describing the state of /usr
, the MCO maintains a checksum for its state and when you look at a node, you can say its configuration is e.g. rendered-master-<checksum>
. If a system can describe its state with a checksum, that implies it has strong anti-hysteresis properties.
Now, there are holes in this model. If for example instead of writing a file directly, you create a systemd unit which does e.g. ExecStart=/bin/echo somedata > /etc/someotherfile
, and you later delete that unit – the file will persist. The reason why relates to this FAQ.
A general pattern here is that any place you have arbitrary code that changes over time writing to persistent files, you’re at risk of hysteresis (or "unmanaged state").
Reprovisionable
OK, so systems with anti-hysteresis properties are good. But in practice, I think there’s always going to be that small amount of "unmanaged state" that sneaks in even in organizations with strong discipline. For example, a system administrator trying to debug one node and using ssh
to edit a file directly to increase the debug level of a service, and then later that causes a problem by flooding the log system or causing more I/O to the local filesystem and increasing latency for other services.
And this problem isn’t just at the operating system layer; at the IaaS/CaaS layer it’s easy to have VMs or containers that were created manually to debug something and then "leak" unless actively removed.
In an IaaS deployment there are a wide variety of objects in general (storage buckets, SaaS etc.) and equally many tools to deal with leaks at that level; usually this boils down to a "resource tagging" approach. (One thing I think is nice about GCP over e.g. AWS is the "project" approach, specifically this bit: "This model can also be useful for testing purposes: once you’re done with a project, you can delete the project, and all of the resources created by that project will be deleted as well.")
At the operating system level (and at the IaaS level if you can too), I think a good way to deal with this is to periodically reprovision, e.g. once a month (if you can do faster, great) on a rolling basis. In OpenShift 4 for example with the machine-api-operator that would just be a small amount of code (a custom controller running as a pod) to periodically kubectl delete machine/<somemachine>
based on whatever criteria you want – the platform will handle the rest, spinning up a new one to take its place. Currently this only applies to workers but I hope we can cover the control plane in future releases. A neat thing about this is that the IaaS layer (virtual machines) are just Kubernetes custom resources that are managed via the cluster.
Conclusion: We want reprovisonable, anti-hysteresis systems
There a whole lot of current terms for what I’ve covered above, "gitops", "managed configuration", "cattle", stateless", "immutable infrastructure" etc. I’m suggesting the goal is: reprovisionable infrastructure with anti-hysteresis properties. But, I’d also be happy if we used "reprovisionable" instead of "cattle", and also if we introduced the term "anti-hysteresis" instead of "immutable" (where applicable).