>
Linux, Planet, PostgreSQL, Technical

Challenges of Postgres Containers

Many enterprise workloads are being migrated from commercial databases like Oracle and SQL Server to Postgres, which brings anxiety and challenges for mature operational teams. Learning a new database like Postgres sounds intimidating. In practice, most of the concepts directly transfer from databases like SQL Server and Oracle. Transactions, SQL syntax, explain plans, connection management, redo (aka transaction/write-ahead logging), backup and recovery – all have direct parallels. The two biggest differences in Postgres are: (1) vacuum and (2) the whole “open source” and decentralized development paradigm… once you learn those, the rest is gravy. Get a commercial support contract if you need to, try out some training; there are several companies offering these. Re-kindle the curiosity that got us into databases originally, take your time learning day-by-day, connect with other Postgres people online where you can ask questions, and you’ll be fine!

Nonetheless: the anxiety is compounded when you’re learning two new things: both Postgres and containers. I pivoted to Postgres in 2017, and I’m learning containers now. (I know I’m 10 years late getting off the sidelines and into the containers game, but I was doing lots of other interesting things!)

Postgres was already one of the most-pulled images on Docker Hub back in 2019 (10M+) and unsurprisingly it continues to be among the most-pulled images today (1B+). Local development and testing with Postgres has never been easier. For many developers, docker run postgres -e POSTGRES_PASSWORD=mysecret has replaced installers and package managers and desktop GUIs in their local dev & test workflows.

With the widespread adoption of kubernetes, the maturing of its support for stateful workloads, and the growing availability of Postgres operators – containers are increasingly being used throughout the full lifecycle of the database. They aren’t just for dev & test: they’re for production too.

Containers will dominate the future of Postgres, if only because I bear the scars of managing 15-year-old servers where the package manager database never matched reality and there were 20 different copies of python and 30 different copies of java installed under various root and user directories.

But what exactly is a container? What is inside that thing? In fact, a lot more than I first thought. Six months ago I was convinced there’s no possible way glibc was in that container. You can’t just take a glibc from 2024 and run it on a kernel from 2016. Right?

$ docker run --interactive --tty debian:bookworm-slim

root@27234bdf966e:/# dpkg -l libc6
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version        Architecture Description
+++-==============-==============-============-=================================
ii  libc6:amd64    2.36-9+deb12u9 amd64        GNU C Library: Shared libraries

root@27234bdf966e:/# dpkg -l|grep "  lib"|wc -l
51

root@27234bdf966e:/# dpkg -l|grep -v "  lib"|wc -l
42

root@27234bdf966e:/# exit

Basically, there can be a whole operating system inside that container! (Minus the kernel.) In practice there are a range of “base operating systems” from hyper-slim alpine (a la busybox) to containers that run a full copy of systemd and provide a full operating system experience. Docker’s official Debian-based Postgres containers use the “slim” debian OS container as a base (88 packages and 74MB) and are customized with additional packages from PGDG and the Debian universe (total 146 packages and 434MB).

The glibc-to-kernel cross-compatability is magic to me. It’s not by chance. Libraries like glibc are pretty tightly coupled to the kernel, and it’s an intentional effort by both linux kernel maintainers and glibc maintainers to maintain cross-compatibility. Like the intentional effort by Postgres maintainers to maintain ABI compatibility across Postgres “minor release” bugfix versions.

Combined with a good kubernetes operator, Postgres containers are production ready today.

But containers have a few rough edges. It’s important to know about them if you’re going to move toward production operations with Postgres containers.

Security and Isolation

Containers are secure enough for the vast majority of companies and use cases. The underlying technology is well maintained, new vulnerabilities are addressed promptly, fixes are made available quickly, and designs are thoroughly reviewed. Kernel isolation capabilities have been tested by world-class pen testers and red teams.

However there is a meaningful difference between kernel isolation and hardware VM-based isolation. The firecracker paper presented at Usenix 2020 is the best writeup that I’ve seen on the topic so far.

Fundamentally it’s about attack surface area within the boundary between unprivileged and privileged execution. At the end of the day, a general-purpose operating system kernel’s syscall interface is composed of hundreds of critical functions with complex implementations. Virtual Machine Monitors (VMMs) and processor instruction sets are comparatively simpler with better-understood abstractions. Virtualization is not immune from attack – recent incidents like Meltdown and Spectre and other side-channel/speculative-execution attacks have proven the point – but reducing attack surface area is fundamental in very-high-security environments.

The vast majority of companies should not be disabling SMT on their processors or avoiding containers. There is sometimes a trade-off between security and cost/performance. Hyperscalers and SAAS companies have use cases where they have to opt for virtualization even when it’s less cost-effective.

Most readers here can deploy with traditional containers. Just understand the reasoning and be cognizant of the choice.

Host/Node Operating System Compatibility

You can’t just take a glibc from 2024 and run it on a kernel from 2016. Right?

The answer is actually a little more nuanced than you’d expect.

The terminology used by Scott McCarty on the Red Hat blog around 2019-2020 is portability, compatibility and supportability. Scott’s a product manager who is particularly concerned about commercial contracts between Red Hat and its customers. The term supportability is an explicit reference to “scope of what Red Hat fully <contractually?> commits to debug and fix, as part of what you are paying for”.

But I think the terms are helpful even for people who just want to run their business on containers and are not Red Hat customers.

The standardized file format of containers makes them highly portable across systems and software. OCI-compliant containers can be copied and understood by tooling anywhere, but that doesn’t mean they can run anywhere.

Compatibility is about where containers will run. Naturally, if you compile for ARM then it’s not going to run on x86. I don’t fully understand yet how compatibility works across operating systems (linux, mac and windows)… I think there has been some clever engineering in recent years to create more compatibility here than what used to exist.

Things seem to get more interesting across different versions of the linux kernel.

Internet forums are full of people pointing out that the Linux kernel APIs are decades old and change rarely so your containers will probably run fine on any Linux. But you also don’t need to go far to find examples of things that break, like centos:6 bash crashing on Ubuntu 18.04 or useradd failing when the host is upgraded to RHEL 7 (and continuing to work fine on RHEL 6).

Even if you don’t have any intentions of becoming a Red Hat customer, I think it’s informative to read their official container support policy and their official container compatibility matrix. In particular: their “workload-specific” guidelines for container compatibility:

  • Run as an unprivileged container (ie. don’t pass the --privileged flag)
  • Do not interact directly with kernel-version-specific data structures (ioctl, /proc, /sys, routing, iptables, nftables, eBPF, etc) or kernel-version-specific modules (KVM, OVS, SystemTap, etc.)

That’s good advice. It should mostly keep you out of trouble. Don’t forget it’s not just about your code, but also about your dependencies – even debian packages & binaries you pull into your container.

I think Postgres and most Postgres extensions should be fairly safe. They may not always strictly follow the rules above, but I think if any problems are found in core postgres (or a widely used extension) they’re likely to be taken seriously. The Postgres community generally tends to value portability & compatibility.

Red Hat generally recommends building containers with the same base OS major version as the host where they run. My own opinion is to stay “close”. Stick with major distro versions released within a few years of each other. Just my opinion – but I would probably look at distro release date over linux kernel version, given how aggressively kernels are sometimes patched by distros.

Container Versioning and Change Management

Lies. Containers don’t actually have versions. They have tags.

In Postgres, and for that matter any major linux distribution, if I ask for a specific version today – and then I ask for the same version next week – I will get the same bits. In fact, the official Debian Policy Manual section 3.2.2 codifies what I thought was common sense:

The part of the version number after the epoch must not be reused for a version of the package with different contents once the package has been accepted into the archive, even if the version of the package previously using that part of the version number is no longer present in any archive suites.

Containers don’t work like this at all. Practically every example on the internet makes it look like you can ask for a specific version of Postgres with docker run postgres:17.2 – but it turns out that 17.2 is just an arbitrary tag and not really a version number.

The docs are clear that it’s just a tag, but it’s all very confusing to newcomers – and there are dangers lurking here with Postgres.

The biggest danger is around the now-infamous glibc collation problems.

As early as 2017, a user of containerized Postgres 9.5 switched from tag 9.5 to tag 9.5-alpine and their data seemed to disappear. I suspect this was likely related to collation.

Debian v10/Buster was released in 2019 (with the big scary glibc change), and the docker community hit the brakes on updating their images due to the known problems. Finally in 2021 they caved in and added a bunch of complexity to their build scripts, in order to start building for two major debian versions at once. And thus was born the tags 10-stretch and 10-buster. The community instituted a policy of supporting only the two most recent major versions of debian (stable and oldstable). The “default” tags where no OS is specified (eg 17 or 17.2) change which major OS they are pointing to. This has resulted in a steady stream of problem reports, every time a new debian major was released.

Debian v11/Bullseye was released Aug 14, 2021. On Nov 10 a GitHub issue was opened from a user seeing incorrect sort order in Russian. Debian v12/Bookworm was released on June 10, 2023. On June 15 GitHub issue was opened by a user getting collation version mismatch warnings, and the torture test scan indicates this jump (2.31 to 2.36) likely includes changes in the Oriya and Kurdish languages (in 2.32). I haven’t yet checked if ICU has changes in bookworm.

The takeaway is summarized well in the GitHub Issue:

It is possible to completely avoid surprise changes when deploying containers: image digests can be used instead of image tags. But I think in most cases, using the tags as described above is the best solution. The tag postgres:15-bookworm is locked onto the Debian and Postgres stable releases, so you’ll automatically get security and critical updates by using a tag like this. Just make sure to include the operating system part!

And remember that you can’t just switch the tag to a new Operating System version unless you want to risk corruption. If you want to be 100% safe then you need to logically pg_dump-and-load, or use logical replication to move your data to the new operating system container image, or set your default provider to the new pg17 builtin C collation and use linguistic collation at a query or table level when needed and rebuild /all/ dependent objects on OS changes.

Memory Management

https://github.com/kubernetes/kubernetes/issues/43916 has now been open for 7 years and has 141 comments. Still going strong this month.

A few folks referenced https://github.com/linchpiner/cgroup-memory-manager as a workaround, but I’m not sure whether I’d use this with Postgres… at present I think the safest option with postgres remains the request==limit configuration.

An engineer from Bucharest named Mihai Albert wrote a very interesting blog post a few years ago that digs into detail on the behavior. I think his blog might be based on cgroups v1. I hadn’t seen it before, but it’s referenced from that GutHub issue. https://mihai-albert.com/2022/02/13/out-of-memory-oom-in-kubernetes-part-4-pod-evictions-oom-scenarios-and-flows-leading-to-them

I took my first swing at consolidating, organizing and writing what I learned back in September: Kubernetes Requests and Limits for Postgres … still a lot I don’t know or haven’t dug into yet.

Overall, I can’t quite tell whether there are any actual kubernetes code improvements on the horizon yet. We might still be in the “digging and discussing” stage.

Summary

Combined with a good kubernetes operator, Postgres containers are production ready today. I’m not sure whether I’d go learn kubernetes just to run Postgres but the reality is that kubernetes is already in use at many companies for application workloads. If kubernetes is already being deployed, then learning and leveraging it for Postgres makes sense.

2024 has been an exciting year for me and I’m very happy for the opportunity to begin really digging into Postgres containers. My four main concerns are outlined here, and they aren’t dampening my enthusiasm.

TLDR (I’m relatively new to containers, so hopefully I’m getting these right):

  1. don’t be intimidated to learn postgres if you’re a DBA for another database – it’s easier than you think
  2. set your database default collation to the new v17 builtin C collation, and use ICU at a table or query level in cases where you need linguistic collation. rebuild only those objects on OS changes.
  3. include the OS in your container “version” tags when deploying postgres containers
  4. deploy on a host/node with an OS major version released within a few years of your base container image (if the same OS major version, then all the better)
  5. for now, stick with request==limit for kubernetes memory allocations.

Postgres containers solve problems that have haunted us for decades, and they are here to stay.

About Jeremy

Building and running reliable data platforms that scale and perform. about.me/jeremy_schneider

Discussion

No comments yet.

Leave a New Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Disclaimer

This is my personal website. The views expressed here are mine alone and may not reflect the views of my employer.

contact: 312-725-9249 or schneider @ ardentperf.com


https://about.me/jeremy_schneider

oaktableocmaceracattack

(a)

Enter your email address to receive notifications of new posts by email.

Join 71 other subscribers